October 21, 2002
Ms. Jennifer J. Johnson
Secretary, Board of Governors of the Federal Reserve System
20th Street and Constitution Ave.
NW, Washington, D.C. 20551
Re: Docket No. R-1128 Draft Interagency White Paper on Sound Practices to Strengthen the Resilience of the U. S. Financial System
Dear Ms. Johnson,
IBM appreciates the opportunity to submit its letter of comment in response to the Draft Interagency White Paper on Sound Practices to Strengthen the Resilience of the U. S. Financial System. Subject matter experts from IBM's Business Continuity and Recovery Services, Integrated Technology Services National Practice, Resilient Business and Infrastructure Solutions National Practice, Product Research and Development and Financial Services Sector contributed to this effort.
The response is comprised of two sections. Section One presents IBM's general comments and initial thoughts regarding the business continuity, people, process, and technology issues associated with developing future guidance. Section Two contains IBM's response to the specific questions raised in Section V of the white paper.
IBM agrees with the agency's broad consensus on the three business continuity objectives that have special importance after September 11:
- Rapid recovery and timely resumption of critical operations following a wide-scale regional disruption.
- Rapid recovery and timely resumption of critical operations following the loss or inaccessibility of staff in at least one major operating location.
- A high level of confidence, through ongoing use or robust testing, that critical internal & external continuity arrangements are effective and compatible.
IBM supports the agency's goal of identifying methods to develop a more robust business continuity plan that strengthens the overall resilience of the U.S. financial system. To realize this breadth and depth of systemic resiliency, all core clearing and settlement organizations must be engaged to protect the relationships that each maintains in ensuring the vitality of the system. The failure of one or more firms to ensure sufficient resiliency in its own domain will mitigate the effectiveness of those that have complied with the guidelines and significantly weaken the resiliency of the system as a whole.
Very truly yours,
|Howard A. Fields
Resilient Business &
& Recovery Services
Section One - General Comments and Issues
- IBM believes the guidelines proposed by the agencies are a rational step towards strengthening the resilience of the U.S. Financial System.
- The agencies should strive to maintain the right balance between providing guidance and crafting regulations. These guidelines should lead, rather than drive, individual members of the U.S. Financial System to an understanding that increased consistency in how they respond to severe disaster strengthens the entire system
- The resilience of the U.S. Financial System could be further strengthened by considering other types of stresses and strains beyond severe regional disasters.
At the most basic level disaster planning focuses on three primary questions. First, what are the expected disruptions? Second, what is the anticipated probability of their occurrence? And third, to what extent will an occurrence cause harm? The guidelines, and our response, focus on mitigating severe regional disruptions as a primary design point. We believe that discussing this topic within the context of Business Continuity, People, Process and Technology is the best assessment approach and have organized our response accordingly.
Several variables should be taken into account when striving to achieve systemic resiliency. Foremost, to be in a position to address the potential of a debilitating event impacting the financial system, a detailed business flow and process mapping of how firms interrelate must be developed and validated. Once these flows and processes are established, key dependencies, critical components and tolerance for outages can be assessed and alternatives presented for enabling continued operations.
Levels of data protection must be in place to ensure information access and data integrity in the event that an interruption does occur. Decisions regarding data should follow from the identification of critical business processes - at both the firm and system levels - and the deployment of appropriate technology solutions designed to maintain availability that supports continued business operations across the financial system.
A common approach to identifying, quantifying, mitigating and controlling risk should be considered with the primary focus at the firm level and a secondary focus at the system level. This 'check and balance' approach to risk management will alleviate the current misalignment in firm-specific continuity planning approaches and enable cross-industry consistency to facilitate systemic resiliency planning and measurement. That said, it should also be noted that this approach could become a "double-edged" sword. While it is unlikely, if all enterprises were to use similar processes and technology it could increase the potential for someone, regardless of their intent, to cause systemic harm by exploiting the commonality.
A consistent method, coupled with leading and lagging metrics, should be developed to validate and test the viability of the cross-industry resilience, continuity and recovery plans. Understanding system variables, identifying accurate levels of component protection (data, applications, servers, physical infrastructure, networks, etc.) and recommending common risk management criteria at both the firm and system levels should result in an improved level of business continuity planning that enhances systemic resiliency across the U.S Financial system.
The white paper indicates that a severe regional disruption would have sufficient impact such that operating staff in the primary location would be unable to relocate and perform in a secondary location. This statement appears to imply that a duplicate staff would need to be maintained at the secondary location. Given that implication, the cost of maintaining staff would increase for firms without generating additional revenue and thus, reduce profitability. This could potentially place some firms at a competitive disadvantage when compared to those organizations that will not be governed by the guidelines. To avoid this there are at least two options to consider. First, a more detailed level of resource planning could be put in place to support adequate redundant operating capability for geographically dispersed centers without measurably increasing staff. Second, there are many sources of outsourcing and business continuity services that financial services firms can select. Both of these options could reduce the need for each enterprise to maintain redundant facilities and/or support staff.
The concentrated nature of the financial services industry creates a ready pool of industry-specific skills that freely move between firms, providing an ongoing injection of new ideas. Support organizations (e.g. industry associations) also tend to locate where a critical mass of individuals is located. Given this, the concept of separation of primary and secondary locations is likely to create new hiring challenges for firms who must attract appropriate, non-diluted skills to the secondary location.
Maintaining ongoing communications between groups is a key factor in recovering business processes and continuing operations. While this has proven to be a challenge when relocating existing staff to a new location, it is anticipated that switching over operations to a remote staff would increase complexity and could potentially jeopardize both the recovery and reconciliation processes. Detailed crisis management processes should be put in place to facilitate communications - both inter- and intra-enterprise - to ensure that everyone is being updated as to recovery progress, availability status and timing for business resumption.
A wide-scale, regional disruption will likely have greater impact on business processes rather than on Information Technology (IT) processes. While increasingly sophisticated "hand-off" processes within IT permit a rapid recovery scenario, the uniqueness of many business processes make it much more difficult to manage and monitor. To be most effective, the design of the overall recovery capability must reflect the needs of these business processes.
Due to the dynamic nature of the financial services industry, there is a continual transformation that takes place on the business side to improve service, satisfy compliance, and reduce cost. Currently, knowledge transfer processes within this environment are often fragmented, ad hoc or people-specific. To enable remote site recovery, process documentation and accessibility will need to become more formalized and should be self-validated more often - at both the firm and industry levels - to ensure applicability, availability and scalability.
For various operational reasons, the industry relies on paper documents as part of its business processes. Transferring required documents to a remote site might introduce greater logistical challenges. Addressing this will require extensive changes in current vital records management processes for non-IT related information. Furthermore, business processes at one firm are often integrated with the processes of others. Moving an operation to a remote site will require reconnecting those processes and reconciling its flow and validation.
Technical capabilities and constraints will continue to play a critical role in determining recovery requirements and implications for business continuity. A primary concern with synchronous disk mirroring technology is that it limits the distance between a secondary site and a primary processing site. Extending that distance introduces new data synchronization concerns, availability vs. performance trade-offs and an increased probability of data loss.
Emerging technology, which is currently in its design/development phase, will enable an organization to largely mask outage events ranging from the loss of a disk storage subsystem to the loss of an entire IT data center. For example, a mainframe technology enhancement might include the ability for a processor hyperswap function to mirror across two sites separated by up to 40 fiber kilometers that will mask disk subsystem outages with the inherent capability being extended to mask data center outages. However, this is also becoming a prerequisite for using synchronous data mirroring, which implies distance limitations for the reasons cited above. Currently, with Recovery Time Objectives of 2-4 hours, organizations can achieve 30-60 minutes with a Recovery Point Objective of zero. Technology vendors will continue to drive enhanced capabilities and will look to reduce recovery time objectives even further.
Networks are the common element across the industry and are critical in sustaining operations. Limitations or weaknesses may hamper the ability of significant market participants to communicate amongst themselves from remote sites and introduce a longer than anticipated recovery processes. The evolution of industry-led standards designed to make it easier for members of the U.S. Financial System to communicate should be anticipated. Efforts could take the form of standardization to enable diversification across the numerous, disparate telecommunication infrastructure designs that are currently employed. The lack of a committed telecommunications infrastructure that ties together the various components of the financial system may degrade or offset the resiliency activities completed by the financial services industry.
The ability to integrate cross-platform data formats will remain an issue in both defining and cost-justifying mirrored solutions in the distributed computing environment. This complexity may indirectly drive firms to revisit these systems and potentially re-architect their applications to build continuity and recovery capability into the production environment. Such a decision could also facilitate the day-to-day data management, backup and recovery capability and provide greater flexibility in remote data recovery solutions.
Feedback IBM has received from clients in the industry indicates that given the current economic environment, firms are seeking to reduce IT expenditures and consolidate operations where possible to generate cost savings. The white paper implies that these firms will be compelled to reverse their focus by substantially increasing the amount of Information Technology spending that will be required to modify current recovery designs and implement the systemic solutions.
Section 2 - Response to Questions
Scope of Application
- Have the agencies excluded any critical markets?
Yes. While retail and equity transactions are not directly considered in this proposal, they could represent a key component of any systemic contingency planning effort.
- Have the agencies sufficiently defined the term "core clearing and settlement organizations" for such organizations to identify themselves?
No response provided at this time.
- Have the agencies provided sufficient guidance for firms to determine whether they play "significant roles in critical financial markets?"
No. Greater detail could be provided to identifying the criteria that an organization may be assessed against to qualify as being a part of this segment. While most of the major contributors can identify with the agency's definition, those that might be in the next tier may require additional guidance on what specific roles are viewed as critical so as to allow minimal room for interpretation. While the size of the firm directly affects the impact to the industry, smaller firms may have equally critical roles and should likewise be held to similar standards to protect system integrity.
- Are there other measures or additional facts or circumstances that should be used to determine whether a firm plays a significant role or acts as a core clearing organization?
Yes. A more extensive and definable set of measures and facts could be used to both define the criticality of an organization and the level of guidance that it may align with.
- Should the agencies establish an average daily dollar volume (e.g., $20 billion, $50 billion, $150 billion or some larger amount) or a market share test (e.g., 3, 5, 7, 10 percent market share or some larger amount) as a benchmark for either or both of these categories?
Yes. This requires finding the right balance between the additional work required and the incremental benefit to the system. There could be more detail in the guidelines, based on transaction volume and dollar volume in critical markets and how firms have traditionally positioned themselves in the specific market segments. Criteria associated with both "average daily dollar volume" and "market share" could represent two of several criteria that may be used to segment the market.
- Should such benchmarks differ by market or activity?
Yes. Benchmarks could differ by market and activity. Additionally, the benchmarks could reflect a relative priority.
- In some market segments, there are geographic concentrations of primary and back-up facilities of firms with relatively small market shares. Should sound practices take into consideration the geographic concentration of the back-up sites of firms that as a group could play a significant role in critical markets?
Yes. Because the agencies are taking a systemic look at a systemic issue, and there are many inter-dependencies among market participants both large and small, it may be important for the agencies to take into consideration all facilities that comprise that system. Many of the smaller firms may in fact contribute inputs to key processes or sub-processes that larger organizations rely on and their failure may impact the overall health of the system. .
- One of the reasons core clearing organizations are expected to recover and resume is that there are no effective substitutes that can assume their critical activities; is this also true for some or all firms that play significant roles in critical markets?
No. For those firms that play significant roles in critical markets, there is additional capacity and other firms could potentially add transactions to their process and cover the absence of a limited number of firms that play significant roles in critical markets.
- Should any firms that play significant roles in critical markets be required to meet an intra-day standard for recovery and resumption because of the size of their market share or volume, or the significance of the services they perform for other firms (e.g. as a correspondent bank or clearing broker) in clearing and settling material amounts of transactions and large-value payments?
Yes. Organizations, regardless of the market segment from which they are categorized, could be required to meet both an intra-day recovery and resumption objective.
- Does the paper's definition of a "wide-scale, regional disruption" provide sufficient guidance for planning for wide-scale, regional disruptions?
Yes. The paper's definition of a "wide-scale, regional disruption" provides sufficient guidance for planning for wide-scale, regional disruptions. However, the definition may have greater value if it is not based solely on geographic distance. Additional factors such as the following may also be considered: workforce dispersion, skills availability and concentration of financial activities within a defined area.
- Is there a need to provide some sense of duration of a wide-scale, regional disruption? If so, what should it be?
No. For planning purposes, duration could be considered indefinite.
Recovery and Resumption of Critical Activities
- Have the agencies identified the critical activities needed to recover and resume operation in critical markets? Is there a need to define the term "material" in this context? If so, what should be used?
Yes. The agencies have identified the critical activities needed to recover operations in critical markets.
Yes. The dictionary definition of the term material leaves market participants open to interpretation as to the size of transactions that should be completed and the degree to which open items, cash and positions need to be reconciled.
- Sound practice seems to require firms that play significant roles in critical markets to establish recovery targets of four hours after an event for their critical activities. Is this a realistic and achievable recovery-time objective for firms that play significant roles in critical markets? If not, what would be?
Yes. It could be realistic from a business standpoint and viewed as achievable given existing technology capabilities and constraints. Whether it is cost justifiable and practicable may be best answered by the firms that have extensive operations that would be impacted and segmented from both an IT and Operations perspective. Mobilizing personnel in a remote site could be a limiting factor. However, the outsourcing option remains a viable alternative.
- Similarly, sound practice seems to require core clearing and settlement organizations to establish recovery and resumption targets of two hours for critical activities. Is this a realistic and achievable resumption-time objective for core clearing and settlement organizations?
No. While existing technology is capable of providing near two-hour mainframe recoverability, several other factors should be considered prior to adopting this recovery and resumption target. Near two hour recovery and resumption denotes hot backup, duplicate connectivity and complete operations support capabilities. This also may require the cooperation of all core clearing and settlement organizations to coordinate this effort across the industry.
Given the various recovery approaches in place today, this goal may be considered to be an aggressive one at this time. However, it is something that could be considered as a long-term goal as technical capabilities increase, costs decline and greater hot backup solution parity is achieved across the industry. A near-term objective may be for next day recovery when processing is suspended for the current day. This would enable impacted firm(s) to re-establish cross-platform and business processing functionality in a more orderly fashion.
- Should recovery- and resumption-time objectives differ according to critical markets?
Yes. Both recovery and resumption time objectives could vary, according to priority and sensitivity, based on market segment. Markets that have the most significant impact on domestic money supply or overall liquidity could be given the highest priority regarding those objectives. Additionally, objectives aligned with Recovery Point Objectives, Recovery Communication Objectives and Data Integrity could also be considered.
- Have the agencies sufficiently described expectations regarding out-of-region back-up resources?
No. Expectations regarding out-of-region back-up resources could be described in greater detail if the terms "region," "resource" and "management of those resources" were also defined.
- Should some minimum distance from primary sites be specified for back-up facilities for core clearing and settlement organizations and firms that play significant roles in critical markets (e.g., 200 - 300 miles between primary and back-up sites)?
No. The value of the guidance as to the distance from the primary site(s) to its back-up site(s) may be compromised if it were to be viewed in absolute terms. While geographic region is an important factor used to provide criteria in developing the guidance, additional measures should be considered that take in to consideration the concentration of business in a particular area and how that concentration affects the industry as a whole.
For example, if an organization's IT facilities are located in a geography that is prone to natural or manmade disasters, a greater distance between sites may be required; conversely, regions where there aren't as high of a probability of wide-scale threats, a smaller distance may be required. As mentioned in Section One, while new technology will provide continuous availability (including a site disaster), this will only be available for enterprises with two sites within defined proximities (currently 40 KM; eventually up to 100 KM).
- What factors should be used to identify such a minimum distance?
The following represents a non-exhaustive list of potential criteria that could be considered when defining a minimum distance: business concentration, power grid, water supply, transportation availability, prevailing weather patterns, availability of IT skills outside the minimum area and limitations on existing communications technology to provide support.
- Should the agencies specify other requirements (e.g., back-up sites not be dependent on the same labor pools or infrastructure components, including power grid, water supply and transportation systems)?
Yes. The more specific the agencies can define the foundation and the spirit from which the guidelines are developed, and translate them into measurable criteria, the better positioned the market will be to transform itself from an economic, competitive and capability perspective.
- Are there alternative arrangements (i.e., within a region) that would provide sufficient resilience in a wide-scale, regional disruption? What are they?
Yes. However, to better define those alternatives and the associated trade-offs, it may be necessary to gain a clear understanding both of the term "sufficient" and how it relates to the criteria that is used to define resiliency and resiliency levels. The following represents a non-exhaustive list of potential criteria that could be considered when defining alternative arrangements: capacity sharing between firms, outsourcing of IT and business processes and staff lending.
- Are there other arrangements that core clearing and settlement organizations should consider, such as common communication protocols that would provide greater assurance that critical activities will be recovered and resumed?
Yes. As one of many potential examples, a standard a communication protocol would simplify the recovery process by removing unnecessary complexities and operational management overhead.
Timetable for Implementation
- To ensure that enhanced business continuity plans are sufficiently coordinated among participants in critical markets, should specific implementation timeframes be considered?
Yes. Developing a systemic plan that is supported by interdependent milestones and phases, which can be validated and revised, could improve the agency's ability to address the systemic environment that, at the very least, may strengthen the "weaker link(s).
- Is it reasonable to expect firms that play significant roles in critical financial markets to achieve sound practices within the next few years?
Yes. If the agencies mandate the practices and can provide clear guidance that creates the environment in which the organizations can successfully transform themselves, this could be an attainable goal.
- Should the agencies specify an outside date (e.g. 2007) for achieving sound practices to accommodate those firms that may require more time to adopt sound practices in a cost-effective manner?
Yes. Such an approach could provide the foundation for monitoring both the point-in-time capabilities and exposures that exist in the end-to-end system and facilitate the initiation of close-gap initiatives to remedy the 'weaker links.' However, there may be additional value in considering other factors, while continuing to include a cost-effective aspect, when defining what the timeframes and milestones should be. Additionally, care may be given to prevent organizations that have more distant time frames from realizing a competitive advantage by delaying near term investments relative to those firms with closer time frames.
- Would such distant dates communicate a sufficient sense of urgency for addressing the risk of a wide-scale, regional disruption?
Potentially. The agencies could prioritize the critical markets to be addressed for sound practices and mandate the timeframe for which they must address a wide-scale, regional disruption and not differentiate by the role in which the organization plays. The agencies could also set specific milestones that would enable a certain level of additional resilience to be achieved in a nearer timeframe.