QinetiQ Trusted Information Management
Comments Relating to the
Draft Interagency White Paper on
Sound Practices to Strengthen the
Resilience of the U.S. Financial System

Date: October 21, 2002
To: Jennifer J. Johnson, Secretary, Board of Governors of the Federal Reserve System [Docket No. R-1128]

Office of the Comptroller of the Currency [Docket No. 02-13]

Jonathan G. Katz, Secretary, Securities and Exchange Commission [File No. S7-32-02]

From: Carl B. Jackson, Vice President, QinetiQ Trusted Information Management, Inc. (QinetQ-TIM)
RE: Comments - Draft Interagency White Paper on Sound Practices to Strengthen the Resilience of the U. S. Financial System

Ladies and Gentlemen:

QinetiQ Trusted Information Management appreciates this opportunity to comment on the "Interagency White Paper on Sound Practices to Strengthen the Resilience of the U. S. Financial System " (the "White Paper"). QinetiQ Trusted Information Management, Inc. ("QinetiQ-TIM") is a provider of information security and continuity planning professional services with international clients in the core clearing and settlements organization sector.

As a former commissioned National Bank Examiner with the OCC as well as the Continuity Planning Service Line Leader for a Big Four accounting firm, my management asked that I contribute comments on the White Paper. The comments below include a background description and tables that contain selected sections of the White Paper together with some additional background materials on QinetiQ-TIM and company management.

We appreciate the opportunity to present our views and are committed to working with the Agencies and the industry to reinforce strengths of the existing structure and to bring about changes that will benefit the industry and its participants. Should you have questions or comments, please feel free to contact Carl Jackson at 281-802-8206 or by email at cbjackson@qinetiq-tim.com.

Background:

The Federal Reserve, the Office of the Comptroller of the Currency, the Securities and Exchange Commission and the New York State Banking Department (the agencies) have been meeting with industry participants to analyze the lessons learned from the events of September 11, with a view towards strengthening the overall resilience of the U.S. financial system in the event of a wide-scale, regional disruption. Ensuring the resilience of critical financial markets requires that core clearing and settlement organizations and other firms that play significant roles in critical financial markets, many of which enjoy the benefits of operating out of major financial centers, will be able to perform their critical activities even in the event of a wide-scale, regional disruption.

Based on in-depth discussions with industry representatives, the agencies have reached certain conclusions regarding the necessity to assure the resilience of critical U.S. financial markets in the face of wide-scale, regional disruptions and identified a number of sound practices to strengthen the resiliency of the overall U.S. financial system and the respective U.S. financial centers. The paper discusses the views of the agencies on sound practices based on discussions with industry representatives on how the events surrounding September 11, 2001, have altered business recovery and resumption expectations for purposes of ensuring the resilience of the U.S. financial system and seeks comments on those views. Based on this extensive dialogue, the agencies have reached certain preliminary conclusions with respect to the factors affecting the resilience of critical markets and activities in the U.S. financial system; sound practices to strengthen financial system resilience; and an appropriate timetable for implementing these sound practices.

The Federal Reserve, the Office of the Comptroller of the Currency, and the Securities and Exchange Commission are publishing this draft white paper on Sound Practices to Strengthen the Resilience of the U.S. Financial System for comment. The New York State Banking Department and the Federal Reserve Bank of New York also participated in drafting the paper. The agencies are seeking comment on the sound practices discussed below. Comments have been invited and are due to be received with 45 days of publication in the Federal Register. Comments are to be delivered to:

    1) The Board of Governors of the Federal Reserve System: Please direct all comments concerning this paper to: Jennifer J. Johnson, Secretary, Board of Governors of the Federal Reserve System, 20th Street and Constitution Avenue, NW, Washington, D.C. 20551, or mailed electronically to regs.comments@federalreserve.gov. [Docket No R-1128]

    2) OCC: Please direct all comments concerning this paper to: Office of the Comptroller of the Currency, 250 E Street, SW, Public Information Room, Mail Stop 1-5, Washington, DC 20219, Attention: Docket No. 02-13; fax number (202) 874-4448; or Internet address: regs.comments@occ.treas.gov. [Docket No. 02-13]

    3) SEC: All comments concerning the paper should be submitted in triplicate to Jonathan G. Katz, Secretary, Securities and Exchange Commission, 450 5th Street, NW, Washington, DC 20549-0609. Comments can be submitted electronically at the following E-mail address: rule-comments@sec.gov. All comment letters should refer to File No S7-32-02; this file number should be included on the subject line if E-mail is used.

QinetiQ-TIM comments:

Overall Comments  
  • Overall Opinion
We consider the Draft White Paper to be well thought out and executed. The overall scope and breadth of the document is considered appropriate and, in fact, we consider it a long overdue set of requirements for financial institutions' continuity planning requirements.

We have three additional comments that deal with the following: (1) Emphasizing the need for financial institutions to take a business process approach (as opposed to a technological focus) to continuity planning; (2) Definition and utilization of the term `time-critical' as opposed to `critical,' and; (3) Establishing an appropriate set of metrics to measure the long-term health and vitality of the institution's continuity planning business process.

  • Emphasize utilization of a business process approach to continuity planning
SUGGESTION: We suggest that financial institutions be encouraged to utilize business process models and mapping when prioritizing `time-critical' business processes and the resources that support them.

EXPLANATION: We consider it essential that the enterprise approach to continuity planning must be process based. That is to say that the methodological approach to continuity planning be business process (mega process, major process, sub-process) focused and include: (1) current state analysis of existing enterprise continuity planning components, business impact analysis, risk management reviews; (2) mapping time-critical business processes to support resources (i.e., IT infrastructure, communications networks, facilities, external partners, people, etc. that support the identified processes); (3) analysis of the most appropriate recovery alternatives given time-critical resource mapping; (4) continuity and crisis management plan development, and development of short- and long-term testing, maintenance, training, and measurement processes, and: (5) deployments of the planning and maintenance processes that were designed in (4) above.

Failure to correctly identify, name, and prioritize business processes with an emphasis on focusing on those that are time-critical will lead to inefficiencies and eventual collapse of the overall continuity planning business process within the enterprise. Reference the Carl Jackson article for Auerbach, entitled "Reengineering the Continuity Planning Process" for more detail on this concept.

  • Define and Use the Term `Time-Critical'
SUGGESTION: We suggest changing the terminology in the Objective and Scope statement, as well as in several other places within the White Paper, the term `critical', the traditionally utilized term, to `time-critical.'

EXPLANATION: The concept of prioritizing `time-critical' business processes and the resources that support them, including IT infrastructure, communications networks (both voice and data), facilities, external partners (trading, vendor, customer, outsourcers, public, etc.) must be emphasized. Defining and using the term `time-critical' is very useful to those who are attempting to determine which parts of the enterprise should receive continuity planning attention, and in what order. The term `time-critical' can easily be differentiated from `mission critical' or simply `critical' functions. To illustrate, it can be said that all time-critical processes are mission-critical, but not all mission-critical processes are time-critical.

Our experience is that between one-third and one-half of the business processes of an enterprise are truly time-critical. Narrowing the focus to time-critical processes and support resources streamlines the continuity planning process making it more efficient to develop, test, maintain, and measure in the long run. Focusing attention on `time-critical' versus simply `critical' processes can spell the difference in the long-term success of the continuity planning processes.

  • Establish Business Continuity process metrics and measurement techniques and processes
SUGGESTION: We suggest that the Agencies emphasize the development of both quantitative and qualitative measurement processes to be deployed along with the continuity planning infrastructures.

EXPLANATION: The reality is that many executive management groups have difficulty understanding the overall value add of the continuity planning processes within their organizations. This has lead to the cyclical process exemplified by on-again, off-again continuity planning projects. What degree of value does continuity planning add to the enterprise people, processes, technology and mission? Great question. It is sometimes difficult to get beyond the financial justification barrier. There is no question that justification of investment in continuity plan business processes based upon financial criterion is important, but it is not usually the financial metrics that drive recovery time windows (recovery time objectives). Continuity planning process metrics must be both quantitative and qualitative. It is the `customer service and customer confidence' issues that drive short recovery timeframes, Short recovery timeframes are typically the most expensive to implement because of the resource commitments involved in securing short-term recovery capabilities. Financial measurements do not always support short recovery windows.

Implementation of an appropriate measurement system is crucial to success. Companies must measure not only the financial metrics, but also how the continuity planning business process adds value to the organizations people, processes, technologies, and mission. These metrics must be both quantitative and qualitative. Focusing on financial measures alone has lead to the on-again, off-again planning referred to earlier, and does not take into consideration the business interruption (customer service or lack of confidence) impacts or a disruption.

The following table outlines comments to specific questions within the White Paper:

Specific Comments  
Summary Data Comments
The agencies invite comments on the appropriate scope and application of the sound practices and implementation timetable discussed above, as well as other issues relevant to strengthening the resilience of the financial system in the face of wide-scale regional disasters. In particular the agencies invite comment in the following areas:  
Scope of application.  
  • Have the agencies excluded any critical markets?
Our view is that the Agencies have included all relevant markets within the breadth of the White Paper.
  • Have the agencies sufficiently defined the term "core clearing and settlement organizations" for such organizations to identify themselves?
The definition of core clearing and settlement organizations is clear and should not require further explanation.
  • Have the agencies provided sufficient guidance for firms to determine whether they play "significant roles in critical financial markets?"
Yes.
  • Are there other measures or additional facts or circumstances that should be used to determine whether a firm plays a significant role or acts as a core clearing organization?
While difficult to define in Agency guidelines, there will likely be a large number of facts and/or circumstances that will need to be used to determine whether a firm plays a significant role as a core clearing organization. This information will only be derived following an appropriately conducted Business Impact Assessment (BIA) for each core enterprise. As part of the time-critical process mapping to support resources, including mapping of support provided by external partners, numerous organizations and organizational components will be identified as being time-critical and will therefore cause that organization to fall under the definition of core clearing organization or as a supporter of such operations.
  • Should the agencies establish an average daily dollar volume (e.g., $20 billion, $50 billion, $150 billion or some larger amount) or a market share test (e.g., 3, 5, 7, 10 percent market share or some larger amount) as a benchmark for either or both of these categories?
It is our opinion that basing the benchmark as a percentage of market share makes the most sense. The percentage approach will also alleviate the Agencies from reissuing guidelines as market conditions change.
  • Should such benchmarks differ by market or activity?
At this point, we feel the benchmark should be applied uniformly across the industry group.
In some market segments, there are geographic concentrations of primary and back-up facilities of firms with relatively small market shares.  
  • Should sound practices take into consideration the geographic concentration of the back-up sites of firms that as a group could play a significant role in critical markets?
Yes. Fortunately, the commercial hotsite vendors are geographically dispersed so that they can offer a minimum level of diversity for backup support for those smaller firms that are geographically concentrated. The eventual Agency Guideline may even cause the commercial vendors to diversify even more than they are presently. As an aside, the major firms in this industry group tend to be multi-national companies with several locations around the world where recovery operations could be organized. The issue will be, for them, the cost of planning for and acquiring appropriate backup resource support (i.e., communications circuits, hardware, personnel, facilities, etc.).
  • One of the reasons core clearing organizations are expected to recover and resume is that there are no effective substitutes that can assume their critical activities; is this also true for some or all firms that play significant roles in critical markets?
This is a difficult question to answer by way of making sweeping generalizations. We feel that as the Agency Guideline goes into effect, many of the firms that play significant roles in critical markets will have to consider acquiring `hot' backup capabilities or consider shifting operations to partners, affiliates, etc., in the short-term following a significant disaster or disruption. Unfortunately, to accurately answer this question, much depends upon the BIA process that must take place within each firm.
  • Should any firms that play significant roles in critical markets be required to meet an intra-day standard for recovery and resumption because of the size of their market share or volume, or the significance of the services they perform for other firms (e.g. as a correspondent bank or clearing broker) in clearing and settling material amounts of transactions and large-value payments?
In order for the eventual Agency Guideline to be effective, we believe that firms that play significant roles in critical markets should be required to meet standards for continuity of operations. This will be an unpopular mandate, but this requirement really begins to get to the main point and reason for the Agency Guideline.
  • Does the paper's definition of a "wide-scale, regional disruption" provide sufficient guidance for planning for wide-scale, regional disruptions?
Yes.
  • Is there a need to provide some sense of duration of a wide-scale, regional disruption?
No. For continuity planning purposes, a `disaster' is declared just as soon as it is determined that the resources that support time-critical processes will be `down' longer than the recovery time objective (RTO), as defined during the BIA process. Once it is determined that downtime will be longer than the RTO, then a `disaster' is declared and recovery activities and tasks are initiated. The anticipated length of the outage, beyond the RTO, is irrelevant. The focus should be on recovery of minimum time-critical operations within the recovery window and to continue to support those operations until the primary functionality is fully restored, no matter the length of time.
  • Have the agencies identified the critical activities needed to recover and resume operation in critical markets?
Two answers here. Yes from a mega-process standpoint. But each of these mega-processes (referred to as critical activities) has a number of major and sub-processes. Some of these major and sub-processes are time-critical and should be subject to continuity planning, and some are not. It would be impossible for the Agencies to accurately identify every time-critical mega, major and sub-process. This can only be done as part of the BIA process within each firm. The Agencies should therefore require that a business process BIA be conducted in order to ensure that each firm has identified all the time-critical activities that support operations.
  • Is there a need to define the term "material" in this context?
Defining materiality is mandatory in understanding how best to prioritize activities for recovery. However, given the size differential of the firms involved as well as in the `mission' of the firms (some have corporate earnings goals, others may have different goals) it is tricky to set one criterion across the industry. Perhaps establishing a framework for determining materiality would be a better approach.
Sound practice seems to require firms that play significant roles in critical markets to establish recovery targets of four hours after an event for their critical activities.  
  • Is this a realistic and achievable recovery-time objective for firms that play significant roles in critical markets?
Yes. RTOs of four hours or less require companies to make more substantial investments in continuity planning arrangements (i.e., communications circuits, hardware, facilities, software, management systems, etc.). Automated operations mirrored processing using RAID technologies, failover processes, or even fully mirrored processing sites is really the only way to achieve less than four hour recoverability. This calls into question the requirement for `Continuous or High Availability' systems and processes.

Evolving with the birth of the web and web-based businesses is the requirement for 24x7 uptime. Traditional RTOs have disappeared for certain business processes and support resources that support the organizations' web-based infrastructure. Unfortunately, simply preparing web-based applications for sustained 24x7 uptime is not the only answer. There is no question that application availability issues must be addressed, but it is also important that reliability and availability of other web-based infrastructure components, such as computer hardware, web-based networks, database file systems, web servers, file and print servers as well as preparing for the physical, environmental, and information security concerns relative to each of these (See RMR above) be undertaken.

One other point here, which is where non-automated operations (i.e., mail room, certain back-office activities, etc.) should also receive the same degree of care as those automated processes. One of the lessons of 9/11 was that many firms had prepared recovery plans for computerized processes, but neglected manual processes.

Similarly, sound practice seems to require core clearing and settlement organizations to establish recovery and resumption targets of two hours for critical activities.  
  • Is this a realistic and achievable resumption-time objective for core clearing and settlement organizations?
When considering recovery for automated applications and processes, RTOs of less than eight hours, that is to say RTOs from one hour up to eight hours, tend to require substantial effort to achieve appropriate recovery alternative solutions. So the answer to this question is yes. As with the firms that play significant roles in critical markets, the RTOs of two hours or less, likewise, require companies to make even more substantial investments in continuity planning arrangements (i.e., communications circuits, hardware, facilities, software, management systems, etc.). For automated operations mirrored processing using RAID technologies, failover processes, or even fully mirrored processing sites is really the only way to achieve less than four hour recoverability, so in these cases some sort of automated failover capability would be required. This calls into question the requirement for `Continuous or High Availability' systems and processes. See discussion above for further thoughts on Continuous Availability.

Also relevant to this segment, non-automated time-critical processes should have the same level of attention as do automated processes.

  • Should recovery- and resumption-time objectives differ according to critical markets?
As a practical matter, RTOs will differ. RTOs for each of the firms within critical markets or for firms that play significant roles in critical markets will differ according to their current market circumstance, automated configurations, outsourcers, telecommunications vendors/suppliers, etc. There are numerous factors that would effect the RTO of each and every firm.
  • Have the agencies sufficiently described expectations regarding out-of-region back-up resources?
Yes. The out-of-region expectations as presented in the White Paper appear appropriate. The challenge with mandating this type of requirement is that there will always be exceptional circumstances where the expectations are simply not appropriate, and when this occurs it will unfortunately detract from other important components of the Agencies Guideline. It would seem better to present broad based out-of-region guidelines within which the firms have the flexibility to select the most appropriate backup resources.
  • Should some minimum distance from primary sites be specified for back-up facilities for core clearing and settlement organizations and firms that play significant roles in critical markets (e.g., 200 -- 300 miles between primary and back-up sites)?
By virtue of the fact that the Agency Guideline is intended to address regional disruptions it would seem logical that a minimum distance be set forth. The challenge is to make such a minimum distance requirement relevant for firms that are located in different geographic locations (e.g., New York City, Atlanta, Los Angeles, Anchorage, Honolulu, etc.). This is where setting a fixed minimum distance will most likely become contentious. Our experience is that there is really no real fixed minimum distance, and that each firm must decide based upon several factors, including location of firm occupied sites, location of hotsite vendors, distance that employees could comfortably or practically commute, etc. Our opinion is that a minimum distance cannot be effectively mandated. Guidelines for setting minimum distances so that affected firms could make informed decisions would be helpful, however.
  • What factors should be used to identify such a minimum distance?
The components of the decision needed to make such a decision include: location of firm occupied sites, location of hotsite vendors, distance that employees could comfortably or practically commute, costs of maintaining and operating backup facilities, operational efficiencies or inefficiencies of offsite backup facilities, hardware/ software/telecommunications resource requirements to name just a few.
  • Should the agencies specify other requirements (e.g., back-up sites not be dependent on the same labor pools or infrastructure components, including power grid, water supply and transportation systems)?
The Agencies should suggest what appropriate or acceptable practices should be, however, it would be very difficult to mandate and then enforce hard requirements in our opinion.
  • Are there alternative arrangements (i.e., within a region) that would provide sufficient resilience in a wide-scale, regional disruption?
Certainly. Even when considering the 9/11 event, operations that were affected in Manhattan, could have been recovered in New Jersey, Boston, Philadelphia, or other relatively nearby localities. In the South and East the primary concern has always been Hurricanes or Tropical Storms. In the West the primary regional concern is seismic in nature. Given these realities, many companies have prepared for out of region recovery, so this is nothing new. The challenge has been that no event of the magnitude or tragedy of 9/11 has really ever stressed the system. The point is that out-of-region or close-to-out-of-region alternative arrangements will continue to be viable under most sets of circumstances. Even with the 9/11 disruption, the hotsite vendors were all successful in helping those firms that needed assistance. On the other hand, should terrorists succeed in detonating a region affecting nuclear or bio-chemical weapon(s), then close-to-out-of-region alternatives may well prove ineffective.
  • Are there other arrangements that core clearing and settlement organizations should consider, such as common communication protocols that would provide greater assurance that critical activities will be recovered and resumed?
Without getting specific, and from a continuity planning perspective, it is always preferable to try and recover to as close to an identical configuration as possible. The more that the recovery capability looks like the primary or original operation, the more smoothly the recovery will likely be. Therefore, it is very desirable for standardization of communications protocols and other mechanisms that would make recovery as transparent as possible. Any encouragement the Agencies could provide manufacturers and/or industry groups in this area would help tremendously.
  • To ensure that enhanced business continuity plans are sufficiently coordinated among participants in critical markets, should specific implementation timeframes be considered?
Yes. It is our opinion that the timeframe for implementation should be coordinated among the firms and carried out quickly, given an appropriate upfront preparation time. Our experience is that once a medium to large organization (Fortune 500) decides to implement continuity planning for selected time-critical operations, that the BIA and Current State Assessment activities take between three to four months. Recovery alternative decisions and plans can be written within just a few weeks (up to two to three months) with initial walk-through testing of these preparations beginning immediately following plan and continuity planning process deployment. Giving companies more time than this usually results in them slowing the implementation to fit the timeframe and often leads to failure as other company priorities often pop up to take away attention from the usually unexciting continuity planning efforts. Our opinion is that the Agencies give all affected firms 6 months notification to begin, and another twelve months to come into full compliance. Additional time will usually not make efforts more efficient, and may indeed distract many from the immediacy of the effort.
  • Is it reasonable to expect firms that play significant roles in critical financial markets to achieve sound practices within the next few years?
Yes. See comment above. We believe that the less time allowed the better. From notification by the Agencies to compliance (demonstrated by `meaningful testing') by the firms should be no longer than eighteen months, unless very special circumstances call for additional time. We can also observe that there is `never' a good time to perform continuity planning. Why? Because there is always a systems conversion coming, or a reorganization, or a personnel change, or some other event off into the future where it would seem logical to postpone continuity planning until it is completed. The fact is, that time really never seems to come, so postponement pending some future event should not be an option unless that event is exceptional or spectacular.
  • Should the agencies specify an outside date (e.g. 2007) for achieving sound practices to accommodate those firms that may require more time to adopt sound practices in a cost-effective manner?
Yes. See our comments above. We would suggest no more than eighteen months from notification of intent to initiation of meaningful testing. The term `meaningful testing' should be clearly defined. Exceptions should be made only in extreme cases and only for those firms who can prove that the additional time is really needed. The exception process should be rigorous enough so as to discourage application for waiver of the 18-month timeframe for frivolous reasons.
  • Would such distant dates communicate a sufficient sense of urgency for addressing the risk of a wide-scale, regional disruption?
No, and again we recommend the eighteen month window suggested above, with the rigorous exception process for unusual circumstances.

Specific Comments from White Paper High-Level Outline (Other)  
Summary Data Comments
  • Rapid recovery and timely resumption of critical operations following a wide scale, regional disruption;
SUGGESTION: We suggest an alteration in the wording of this Objective to `time-critical' versus the traditionally utilized term `critical.'

EXPLANATION: The concept of `time-critical' business processes and the resources that support them, including IT infrastructure, communications networks (both voice and data), facilities, external partners (trading, vendor, customer, outsourcers, public, etc.) should be emphasized. Defining and using the term `time-critical' is very useful to those who are attempting to visualize which parts of the enterprise should receive continuity planning attention and in what order. The term `time-critical' can easily be differentiated from `mission critical' or simply `critical' functions. To illustrate, it can be said that all time-critical processes are mission critical, but not all mission-critical processes are time-critical. Our experience is that between one-third and one-half of the business processes of an enterprise are truly time-critical. Narrowing the focus to time-critical processes and support resources streamlines the continuity planning process making it more efficient to develop, test, maintain, and measure in the long run. Focusing attention on `time-critical' versus simply `critical' processes can spell the difference in the long term success of the continuity planning processes.

  • Rapid recovery and timely resumption of critical operations following the loss or inaccessibility of staff in at least one major operating location; and
(Note: Same comment as above pertaining to terminology - critical versus time-critical)
  • A high level of confidence, through ongoing use or robust testing, that critical internal and external continuity arrangements are effective and compatible.
(Note: Same comment as above pertaining to terminology - critical versus time-critical)
The agencies view these sound practices as being most applicable to organizations that present a type of systemic risk should they be unable to recover or resume critical activities that support critical markets. In this context, "systemic risk" includes the risk that the failure of one participant in a transfer system or financial market to meet its required obligations will cause other participants to be unable to meet their obligations when due, causing significant liquidity or credit problems and threatening the stability of financial markets.

The organizations that could present such systemic risk should they be unable to recover (i.e., complete) and resume (i.e., carry on) critical activities consist of core clearing and settlement organizations.

Considered appropriate, no further comment.
Other firms that play a significant role in critical financial markets also could contribute to systemic risk in payment and settlement systems should they be unable to recover critical activities. These organizations and key terms are described more fully below. Considered appropriate, no further comment.
  • Critical markets provide the means for banks, securities firms, and other financial institutions to adjust their key cash and securities positions and those of their customers in order to manage significant liquidity, market, and other risks to their organizations.
Considered appropriate
  • Certain markets such as the Federal funds and government securities markets also support the implementation of monetary policy.
Considered appropriate
  • Federal funds, foreign exchange and commercial paper Government, corporate, and mortgage-backed securities "Core clearing and settlement organizations" consist of market utilities that provide critical clearing and settlement services for financial markets and large value payment system operators.
Considered appropriate
  • Core clearing and settlement organizations also consist of firms that provide similar critical clearing and settlement services for critical financial markets in sufficient volume or value to present systemic risk in their sudden absence, and for whom there are no viable immediate substitutes.
Considered appropriate
"Firms that play significant roles in critical financial markets" are those that participate in sufficient volume or value such that their failure to perform critical activities by the end of the business day could present systemic risk. The agencies believe that many if not most of the 15 - 20 major banks and the 5-10 major securities firms, and possibly others, play at least one significant role in at least one critical market. In the context of these sound practices, the agencies are considering the benefit of providing additional guidance (e.g., in terms of market-share or dollar-value thresholds) to help firms identify the category into which they fall for the specific activities they perform. Considered appropriate
For purposes of these sound practices, a "wide scale, regional disruption" is one that causes a severe disruption of transportation, telecommunications, power, or other critical infrastructure components across a metropolitan or other geographic area and its adjacent communities that are economically integrated with it; or that results in a wide scale evacuation or inaccessibility of the population within normal commuting range of the disruption's origin. Considered appropriate
A. Resilience of Critical Markets and Activities in U.S. Financial System Critical Markets.  
The resilience of the U.S. financial system in the event of a wide-scale, regional disruption rests on the rapid recovery and resumption of critical financial markets defined above and the activities that support them. Considered appropriate
The rapid restoration of critical financial markets, and the avoidance of potential systemic risk, requires firms that play significant roles in those markets to recover business processes and functions sufficient to complete critical activities by the end of each business day. Considered appropriate
These critical activities are:

    a) Completing pending large-value payment instructions;

    b) Clearing and settling material pending transactions;

    c) Meeting material end-of-day funding and collateral obligations necessary to assure the performance of items a) and b) above;

    d) Managing material open firm and customer risk positions, as appropriate and necessary to assure the performance of items a) through c) above;

    e) Communicating firm and customer positions necessary to assure the performance of items a) through d) above, reconciling the day's records, and safeguarding firm and customer assets; and

    f) Performing all support and related functions that are integral to the above critical activities.

Considered appropriate
The rapid resumption of critical financial markets requires that core clearing and settlement organizations are able to recover and resume within the business day the critical activities they perform that support the recovery of critical markets. Considered appropriate
B.

    a) Processing new large-value payment instructions;

    b) Clearing and settling material new transactions;

    c) Managing material ongoing funding and collateral requirements necessary to assure the performance of items a) and b) above;

    d) Managing material ongoing firm and customer risk positions, as appropriate and necessary to assure the performance of items a) through c) above;

    e) Communicating changes in firm and customer positions necessary to assure the performance of items a) through d) above, reconciling the day's records, and safeguarding firm and customer assets; and

    f) Performing all support and related functions that are integral to the above critical activities.

Considered appropriate
Sound Practices to Strengthen U.S. Financial System Resilience The agencies have identified the following sound practices for core clearing and settlement organizations and other firms that play significant roles in critical financial markets. Considered appropriate
The sound practices address the risks of a wide-scale, regional disruption and strengthen the resilience of the financial system. Considered appropriate
They also reduce the potential for a regional disruption to have an undue impact on one or more critical markets because primary and back-up processing facilities and staffs are concentrated in a particular geographic region. Considered appropriate
Core clearing and settlement organizations and other firms that play significant roles in critical financial markets should identify all the critical activities they perform in support of critical markets. Considered appropriate
2. Determine the appropriate recovery and resumption objectives. Considered appropriate
  • Firms that play significant roles in critical financial markets should, at a minimum, plan to recover on the same business day the critical activities they perform that support the recovery of critical markets.
Considered appropriate
  • In fact, an emerging industry objective appears to be for firms that play significant roles in critical financial markets generally to set a recovery time target of no later than four hours after the event.
Considered appropriate
  • Core clearing and settlement organizations should plan both to recover and to resume fully within the day their critical activities that support critical financial markets.
Considered appropriate
  • An emerging industry objective appears to be for such organizations generally to set a resumption-time target no later than two hours after the event.
Considered appropriate
3. Maintain sufficient out-of-region resources to meet recovery and resumption objectives. Considered appropriate
  • Firms that play significant roles in critical markets, at a minimum, should have back-up arrangements with sufficient out-of-region staff, equipment, and data to recover their critical activities within their recovery-time objectives.
  • These arrangements can range from a firm establishing its own out-of-region back-up facility for data and operations, to arranging for the use of remote outsourced facilities.
Considered appropriate
  • The objective is to minimize the risk that a primary and a back-up site, and their respective labor pools, could both be impaired by a single wide-scale, regional disruption, including one centered somewhere in between them.
Considered appropriate
  • Core clearing and settlement organizations should have sufficient out-of-region resources both to recover and to resume fully their critical activities within their recovery and resumption-time objectives.
Considered appropriate
  • Although there may be a variety of approaches that could be effective, out-of-region back-up locations should not be dependent on the same labor pool or infrastructure components used by the primary site, and their respective labor pools should not both be vulnerable to simultaneous evacuation or inaccessibility.
Considered appropriate
  • Infrastructure components include transportation, telecommunications, water supply and electric power.
Considered appropriate
4. Routinely use or test recovery and resumption arrangements.  
  • Firms that play significant roles in critical financial markets and core clearing and settlement organizations should routinely use or test their individual internal recovery and resumption arrangements for required connectivity, functionality, and volume capacity.
Considered appropriate
  • Such institutions should also work cooperatively to design and to schedule appropriate cross-organization tests to assure the compatibility of individual recovery and resumption strategies within and across critical markets.
Considered appropriate
  • There are many important business and internal control reasons for having processing sites near financial markets and firms' headquarters.
Considered appropriate
  • It is the separation between primary and alternative processing sites that is important in promoting resilience.
Considered appropriate
  • Firms should be enhancing their business continuity plans to address wide-scale, regional disruptions; including adoption of implementation plans to achieve these sound practices.
Considered appropriate
  • To the extent that these sound practices require revisions of the plans, they should be completed as soon as possible and no later than 180 days after the agencies issue their final views.
Considered appropriate
  • The agencies recognize that firms that play significant roles in critical financial markets are in different stages of their planning and investment cycles regarding new facilities, technology, staffing, and business processes.
Considered appropriate
  • Furthermore, some have built, or are in the process of establishing, back-up sites or other arrangements that, while improving resilience, may not be fully consistent with these sound practices.
Considered appropriate
  • Given their different circumstances, it may take some firms longer than others to implement all of these sound practices in a cost-effective manner.
Considered appropriate
  • Accordingly, while the agencies recognize the need for some flexibility in implementation timetables, firms nevertheless should strive to achieve these sound practices as soon as practicable.
Considered appropriate
  • All core clearing and settlement organizations, however, should begin to implement plans to establish out-of-region back-up resources within the next year.
Considered appropriate
  • The events of September 11 underscored the fact that the financial system operates as a network of interrelated markets and participants.
Considered appropriate
  • The behavior of an individual participant can have a wide-ranging effect beyond its immediate counter parties.
Considered appropriate
  • Firms agreed that all participants in the financial system should strive to incorporate the three business continuity objectives into their plans; however, they also made clear that "one size does not fit all."
Considered appropriate
  • There was agreement that some critical activities, including safeguarding and transferring funds and financial assets, are so vital to the operation of the financial system that they should continue with minimal disruption, even in the event of a wide-scale, regional disruption.
Considered appropriate
  • All firms recognize the importance of critical financial markets to their own operations and to the financial system overall in the event of a wide-scale, regional disruption.
Considered appropriate
  • Core clearing and settlement organizations play a particularly crucial role in permitting firms and markets that are affected by the event to recover and resume operations as well as in permitting firms and markets that are unaffected to continue to operate.
Considered appropriate
  • For example, in order for firms affected by a disruption to recover critical activities by the end of the day, including clearing and settling pending transactions, clearing and settlement organizations must themselves be able to recover and resume operations within the day.
Considered appropriate
  • In addition, if some firms are unaffected by the disruption and are able to support the continued operation of critical markets to some degree, clearing and settlement organizations must be able to conduct operations.
Considered appropriate
  • If clearing and settlement organizations are not able to operate in such circumstances, they likely will contribute to the amplification of potential systemic risks.
Considered appropriate
  • For core clearing and settlement organizations, the dimensions of this systemic risk would likely be national and even international.
Considered appropriate
  • As a result of these considerations, core clearing and settlement organizations recognize that in the event of a wide-scale, regional disruption they must be able to both recover and fully resume critical activities within the day, and typically within a very limited period of time.
Considered appropriate
  • Firms that play significant roles in critical financial markets also should meet high recovery standards.
Considered appropriate
  • The agencies have found that industry participants generally recognize their respective roles in improving the overall resilience of the financial system and have made it a priority to complete internal preparations, share information and coordinate efforts.
Considered appropriate
  • Firms indicated that economic trades-offs and competitive considerations exist in making strategic decisions about business continuity that require the continuing leadership of senior management and should not be left to the discretion of individual business units.
Considered appropriate

  • B. Recovery of Critical Activities Business continuity plans address a variety of issues, including emergency response procedures assuring the safety of personnel, effective internal and external communications, and implementation of business recovery and business resumption strategies.
Considered appropriate
  • The business continuity planning process involves a careful enterprise-wide analysis, including an assessment of the impact of an unexpected disruption of business processes and associated risks.
Considered appropriate
  • Among other things, plans are designed to manage those risks by arranging for the recovery of critical activities to permit an orderly resolution of outstanding obligations.
Considered appropriate
  • Firms also are expected to monitor their business continuity risks by testing and updating plans periodically.5 Business recovery preparations enable a firm to recover the operation of a disrupted business process or function in order to manage firm and customer risks.6 At a minimum this includes recovery of those "critical activities" necessary to permit the clearance and settlement of pending transactions; management and reconcilement of firm and customer positions; completion of the day's large value payments; and arranging for collateral or end-of-day funding.
Considered appropriate
  • This also includes recovery of activities or systems that support or are integrally related to the performance of these critical business processes or functions.
Considered appropriate
  • Business recovery preparations related to these critical activities are crucial to the smooth operation of the financial system.
Considered appropriate
  • The goal of business resumption is¬ the effecting and processing of new transactions after old transactions have been¬ completed.¬ disruption experienced by a few firms will cascade into market-wide inefficiencies and liquidity dislocations.7 All firms recognize that business recovery is a core element of more comprehensive business continuity plans.
Considered appropriate
  • In discussions with industry members, firms often stated that the financial system is only as strong as its "weakest link."
Considered appropriate
  • Each firm has to ensure that its business continuity plans provide robust business recovery arrangements for the activities it performs that are critical to the smooth functioning of the financial system: wholesale payments processing, and clearance and settlement of money market instruments, government securities, foreign exchange, commercial paper and other corporate securities.
Considered appropriate
  • Industry participants also recognize that core clearing and settlement organizations represent potential single points of failure in the financial system and therefore have the greatest responsibility for ensuring that they can recover and fully resume those activities in a timely manner.
Considered appropriate
  • They also believe that firms that are significant participants in one or more critical markets or that effect a substantial volume or value of wholesale payments should develop robust recovery plans for critical activities in the event of a wide scale disruption when their primary sites and staffs may be inaccessible for some duration.
Considered appropriate
  • Once a firm identifies its critical business functions and processes, it must establish recovery-time targets sufficient to ensure that it can carry out those functions and processes in a manner that will result in minimal disruption to the financial system.
Considered appropriate
  • This facilitates the compatibility of recovery plans across firms and helps assure firms are able to participate in the financial system in times of wide-scale, regional disruptions.
Considered appropriate
  • A number of firms stated that current technology permits recovery-time targets of between one to four hours for many critical activities, even when factoring in the possibility of needing to reconstruct lost data.
Substitute the term recovery-time targets' with recovery time objectives (RTOs)
  • In establishing recovery targets for critical activities, firms are coordinating their plans with the expectations of their respective core clearing and settlement organizations and peers.
Considered appropriate
  • Some payment systems already have established robust recovery targets.
Considered appropriate
  • Core clearing and settlement organizations are holding themselves to an intra-day recovery target -- generally a few hours -- and it is expected that technology will continue to improve upon those recovery times.
Considered appropriate
  • Some also have, or are establishing, recovery times for their participants and, in such cases suggest that firms establish no later than end-of-day recovery targets.
Considered appropriate
  • For example, wholesale payment systems have typically required participants to recover from a disruption in less than four hours, and many firms, including the payment systems themselves, are now able to achieve recovery times of substantially less than two hours.
Considered appropriate
  • liquidity dislocations of the type experienced immediately after September 11 could be seriously compounded.
Considered appropriate
  • Industry members generally agree that recovery of critical activities and processes during a wide-scale, regional disruption requires establishment of some level of out-of-region arrangements for critical operations and the personnel and data that support them.
Considered appropriate
  • The objective of establishing out-of-region arrangements is to minimize the risk that a primary site and a back-up site, and their respective labor pools could be impaired by a single, wide-scale, regional disruption.
Considered appropriate
  • Although there may be other approaches that could be effective, firms generally agree that out-of-region locations should not be dependent on the same labor pool or infrastructure components used by the primary site and should not be affected by a wide-scale evacuation or the inaccessibility of the region's population.
Considered appropriate
  • Examples of such arrangements include a fully operational out-of-region back-up facility for data and operations, and utilizing outsourced facilities in which equipment, software and data are stored for staff to activate.
Considered appropriate
  • With this in mind, certain core clearing and settlement organizations, which are widely expected to recover and resume operations at full capacity indefinitely, and other firms that play significant roles in critical financial markets are establishing remote back-up facilities, in some cases hundreds or even thousands of miles away from the primary site.
Considered appropriate
  • Some firms that already have a national or multi-region presence are planning to utilize out-of-region offices to establish back-up sites.
Considered appropriate
  • Many are finding that there is the potential to achieve out-of-region staffing and system efficiencies by cross training staff or utilizing underused systems to share or shift loads.
Considered appropriate
  • Other firms that play significant roles in markets or in effecting payments also are developing remote arrangements to ensure that they can recover critical data and operations during a wide-scale outage within expected recovery time targets.
Considered appropriate
  • A number of firms in the process of identifying appropriate recovery arrangements stated that the events of September 11 have underscored the importance of building recovery strategies and capacities into their basic business processes.
Considered appropriate
  • 9 Recovery plans must anticipate the need to have sufficient trained staff located at or near the back-up site to meet recovery objectives and plans for resuming a critical function at normal volumes for an extended duration.
 
  • Firms are staffing remote back-up sites in a variety of practical and cost-effective ways.
Considered appropriate
  • For example, firms operating active back-up sites often have full-time staffs that regularly perform the critical activities.
Considered appropriate
  • "software necessary to perform critical business functions and provide access to replicated data."
Considered appropriate
  • This approach allows a firm to recover a function in minutes to a few hours depending on the integrity of the data.
Considered appropriate
  • and other infrastructure providers, and the current limitations on an individual firm's ability to obtain verifiable redundancy of service from such carriers.
Considered appropriate
  • Firms that have out-of-region facilities obtain additional diversity in their telecommunications and other infrastructure services that provide additional resilience in ensuring recovery of critical operations.
Considered appropriate
  • Individual financial firms are also launching industry-wide efforts to explore common infrastructure issues and approaches.
Considered appropriate
  • Other firms plan to cross-train staff already located at remote sites so that they are able to assume responsibility for performing more critical back-up operations during an outage at the primary site.
Considered appropriate
  • Firms that outsource their business resumption facilities to an out-of-region facility may have some staff located there.
Considered appropriate
  • In general, firms that establish out-of-region facilities recognize that relocating employees is useful during the start-up/training period of developing a facility; however, it may be necessary to develop and maintain "local talent" to operate these facilities in the event of an extended outage and loss or inaccessibility of staff at the primary site.
Considered appropriate
  • Some firms do not have sufficient volumes to warrant establishing geographically remote back-up facilities capable of providing full resumption over the near term.
Considered appropriate
  • Nevertheless, many are taking steps to provide for the out-of-region recovery of transactional data and other resources to complete critical activities within target recovery times.
Considered appropriate
  • Ensuring that back-up facilities have access to current data is a critical component of business recovery.
Considered appropriate
  • Firms recognize that out-of-region facilities fall beyond the current distance capacity of some high-volume, synchronous mirrored disk back-up technology,10 and those establishing such facilities are taking a number of steps to minimize the potential for losing data in transit.
Considered appropriate
  • For example, a number of firms are transmitting data continuously to local and remote back-up data centers resulting in multiple back-up databases.
Considered appropriate
  • Others are sending more frequent batches to their remote back-up sites or to data storage locations electronically.
Considered appropriate
  • Some firms maintain multiple replicas of their databases at various locations that can be accessed for production and other uses.
Considered appropriate
  • In addition, a number of firms are establishing active back-up arrangements that permit the primary site automatically to shift production with little or no staff involvement, providing a very rapid recovery capability.
Considered appropriate
  • These steps can significantly reduce the amount of time it takes to recover lost transactions and improve the ability of a firm to recover the function or process.
Considered appropriate
  • Technology is evolving rapidly in this area; for example, software and hardware innovations are expected to provide the ability to maintain synchronous databases at even longer distances.
Considered appropriate
  • Some firms are establishing systems and business strategies that permit the use of continued improvements in technology to achieve the greatest geographical diversity practicable.
Considered appropriate
  • Sound planning includes developing flexible plans that incorporate alternative recovery and resumption arrangements.
Considered appropriate
  • These plans often can be activated to respond to more commonly experienced contingencies that affect fairly small geographic areas and were the subject of most plans before September 11.
Considered appropriate
  • For example, some firms that require real-time data back up have or are establishing in-region back-up sites that employ synchronous technology and are easily accessible in situations that do not involve a wide area disruption.
Considered appropriate
  • 60 -- 100 km. each day; dividing employees into shifts over a 24 hour period; and modifying information systems security access protocols to permit access to desk tops and data from home (virtual offices).
Considered appropriate
  • These measures provide additional resilience in responding to a disruption in an appropriate and practical manner.
Considered appropriate
  • Confidence in Recovery and Resumption Plans through Use or Testing In responding to the events of September 11, many firms used plans developed during Year 2000 preparations.
Considered appropriate
  • Although these plans worked well, some found that backup databases, facilities, contact information and other aspects of their plans were not sufficiently up-to-date.
Considered appropriate
  • As a result, firms expressed a renewed commitment to ensure that critical internal and external business recovery and resumption arrangements are effective, communicated and rehearsed by all staff on a regular basis.
Considered appropriate
  • Some firms report that they are achieving a high level of confidence through the continuous use of two sites (i.e., active--active model), or by switching over to alternate facilities on a regular basis.
Considered appropriate
  • Periodic testing is an important and long-standing component of the business continuity planning process.
Considered appropriate
  • Firms typically stage tests of particular systems, processes (e.g., communications facilities) or business lines to limit risks inherent in tests utilizing production workloads.
Considered appropriate
  • Sound practice includes designing tests to simulate high impact scenarios, e.g., through switch or fail over to back-up facilities with no advance warning.
Considered appropriate
  • One of the lessons learned during September 11 is that testing of internal systems alone is no longer sufficient.
Considered appropriate
  • It also is critical to test back-up facilities with the primary and back-up facilities of markets, core clearing and settlement organizations and service providers to ensure connectivity, capacity and the integrity of data transmission.
Considered appropriate
  • Moreover, firms are planning to share back-up contact information and test arrangements with counter parties and important customers.
Considered appropriate
  • A number of firms and trade associations also have expressed a willingness to participate in or sponsor industry-wide testing.
Considered appropriate
  • As firms successfully complete the more limited testing discussed above, appropriately scaled industry-wide testing could prove beneficial.
Considered appropriate
  • Discussions within the industry on possible approaches are ongoing, and the prospect provides an incentive for firms to complete internal preparations so that there can be maximum participation.
Considered appropriate
  • One possibility may be to take a staged approach by organizing respective tests with the core clearing and settlement organizations.
Considered appropriate
  • As confidence grows, end-to-end tests could be organized.
Considered appropriate
  • After September 11, financial firms naturally initiated a lessons learned process with a view towards strengthening their business continuity plans.
Considered appropriate
  • Industry meetings with the agencies in February 2002 and throughout the spring confirmed that this process is nearing completion at many firms.
Considered appropriate
  • First, firms are taking immediate steps to ensure that they address obvious gaps and refine plans to address near-term risks.
Considered appropriate
  • Many are participating in industry initiatives aimed at improving private sector coordination and identifying sound practices with the intent of assuring that their plans are compatible with their peers.
Considered appropriate
  • Some of these steps include sharing contact information; procuring alternative telecommunications facilities; and meeting with disaster recovery authorities to determine the availability of resources to facilitate business recovery activities.
Considered appropriate
  • Second, firms are well along in reviewing and strengthening long-term strategic plans for business recovery and continuity of operations.
Considered appropriate
  • A number of firms already are discussing alternative solutions at the most senior level to ensure that final plans are consistent with overall business objectives, risk management strategies and financial resources.
Considered appropriate
  • Most firms indicate that they will complete their strategic plans and implementation timetables by year-end or shortly thereafter.
Considered appropriate
  • Some core clearing and settlement organizations already are in the process of establishing out-of-region, fully staffed and operational back-up facilities and expect to be operational within the next year.
Considered appropriate
  • Sound practice for all firms includes implementing long-range plans as soon as practicable in order to protect and enhance their franchise11 and promote confidence in the strength of the financial system.
Considered appropriate
  • It also is important for firms that play significant roles in the financial markets and payments systems to ensure that their implementation plans are consistent with the expectations of those markets, systems and peers.
Considered appropriate
  • Financial industry participants, and in particular those firms that were affected directly or indirectly by the September 11 attacks, are committed to ensuring the continued viability of the U.S. financial system by strengthening their own business continuity plans and improving the resilience of domestic markets and payments systems in the event of a wide-scale, regional disruption.
Considered appropriate
  • Many firms are taking steps to integrate the broader objectives discussed above into their business continuity plans while balancing the costs associated with achieving same-day recovery capabilities for critical activities.
Considered appropriate
  • Core clearing organizations are exploring their intra-day business resumption capabilities.
Considered appropriate
  • It is important to ensure that plans are flexible enough to incorporate evolving technologies that provide greater resilience of critical business functions and processes.
Considered appropriate
  • the implementation of business recovery and resumption arrangements to their utilities¬ and others who are dependent upon the strength of their business continuity arrangements¬ for critical activities, including customers, counter parties and vendors.¬ The agencies believe that the lessons of September 11 are relevant to all financial system participants.
Considered appropriate
  • Accordingly, it is incumbent upon all firms to determine the extent to which it would be practicable to achieve the broader business recovery objectives for critical activities in the near future.
Considered appropriate
  • To the extent that these sound practices require revisions of the plans, firms should largely complete the planning process, including adoption of implementation plans, no later than 180 days after issuance of the agencies' final views and implement them as soon as practicable.
Considered appropriate
  • The agencies recognize that firms that play significant roles in critical financial markets are in different stages of their planning and investment cycles regarding new facilities, technology, staffing, and business processes.
Considered appropriate
  • Furthermore, some have built, or are in the process of establishing, back-up sites or other arrangements that, while improving resilience, may not be fully consistent with these sound practices.
Considered appropriate
  • Given their different circumstances, it may take some firms longer than others to implement all of these sound practices in a cost-effective manner.
Considered appropriate
  • Accordingly, while the agencies recognize the need for some flexibility in implementation timetables, firms that play significant roles in critical markets nevertheless should strive to achieve these sound practices as soon as practicable.
Considered appropriate
  • All core clearing and settlement organizations, however, should begin to implement plans to establish out-of-region back-up resources within the next year.
Considered appropriate
  • Meeting these planning and implementation goals will require the continued oversight and commitment of senior management.
Considered appropriate
  • The agencies will expect core clearing and settlement organizations and other financial firms that play a significant role in critical financial markets to adopt the sound practices outlined in this paper.
Considered appropriate
  • Furthermore, the agencies intend to incorporate these sound practices into supervisory expectations or other forms of guidance for purposes of reviewing the overall adequacy of those portions of business continuity plans that address the recovery of critical activities necessary to ensure the resilience of the financial system.
Considered appropriate
  • Firms can expect the agencies to review plans for their reasonableness and to take a keen interest in the appropriateness of plans to address risk relative to the firm's position in a critical market or in effecting large value payments.
Considered appropriate
  • This will include consideration of the probable effects a disruption of a firm's activities would have on the financial system.
Considered appropriate
  • As part of their ongoing review process, the agencies will consider how firms identify their critical activities, the appropriateness of the recovery and resumption objectives they set, and the adequacy of their plans for achieving those objectives.
Considered appropriate
  • The agencies will include consideration of whether recovery-time and resumption-time targets and implementation schedules are consistent with market and peer expectations.
Considered appropriate
  • Finally, the agencies will review the firm's assessment of test plans and results to confirm that the firm is appropriately able to manage its business risks should a wide-scale, regional disruption occur.
Considered appropriate

About QinetiQ Trusted Information Management

QinetiQ Trusted Information Management has 250 highly qualified professionals with an average of over 10 years of experience per person who bring to you the most important reference of all - a record of achievement and success all over the world.

QinetiQ Trusted Information Management understands the context for information security in your business and delivers assurance of vendor independence, global resources and extensive security expertise. We can leverage our trust relationship with you in your business by enabling you to meet standards and demonstrate best practice in protecting your information. We continue to work on that trust relationship by helping you manage the resulting security infrastructure and response capability.

QinetiQ TIM services are designed to give our clients innovative and sustainable solutions that meet their ongoing information security needs. Our exciting mix of Managed Security Services, Professional Consulting and Forensics expertise, plus an Education Program underpinned by world leading research teams will help clients achieve a durable and proven competitive advantage, now and in the future.

Our Professional Consulting Services range from Risk Assessment, Security Architecture, and Policy Development through to world-leading Penetration Testing and Vulnerability Analysis. Our Secure Operations Centers in both the UK and the US are protected to unparalleled standards that are recognized as military-strength, and our Research Facility is one of largest and most successful in the world. With a long pedigree of managing information security at the highest level and providing leading edge research into the future, QinetiQ Trusted Information Management provides a breadth and depth of Information Security Services unique in the industry.

Management Team

CEO, John Holland

John Holland heads up QinetiQ plc's Trusted Information Management business and has additional responsibility for Finance market contracts. Based at the UK Headquarters, he has over 25 years experience working in the computer industry. Prior to joining QinetiQ he worked for Symantec where he was the Vice President of Worldwide Security Services, responsible for Symantec's Professional, Education and Managed Security Service business. John joined Symantec from AXENT where he had been the Vice President for Europe, Middle East and Africa, responsible for building the business in these areas.

President, Mike Corby, CISSP, CCP

Mike Corby has been a practicing IT professional for more than 30 years specializing in systems technology management and computer security. Mike joins QinetiQ from Netigy Corporation where he was Vice President of Global Security Practice. As a Technology Specialist, Systems Manager and CIO for large international corporations, and as the Consulting Director of hundreds of Systems and Technology projects for several diverse companies, he has put many theories and creative ideas into practice. He has worked as Practice Director for the IT consulting branch of Ernst & Young and CIO for a division of Ashland Oil and the Bain & Company Consulting Group. He is a certified Information Systems Security Professional (CISSP) and Certified Computer Professional (CCP). In 1994 he was awarded a lifetime achievement award by the Computer Security Institute.

Vice President - Continuity Planning/QAR, Carl Jackson, CISSP

Carl B. Jackson is a Certified Information Systems Security Professional (CISSP) and brings more than 25 years of experience in the areas of business continuity planning, information security, and information technology internal control reviews and audits. As the QinetiQ Trusted Information Management, Inc. Vice President-Continuity Planning, he is responsible for the continued development and oversight of QinetiQ-TIM (US) methodologies and tools in the enterprise-wide business continuity planning arena including network and eBusiness availability and recovery. Before joining QinetiQ-TIM, Mr. Jackson served as the continuity planning practice leader and Partner with Ernst & Young LLP. Mr. Jackson has extensive consulting experience with numerous major organizations in multiple industries, including: manufacturing, financial services, transportation, healthcare, technology, pharmaceutical, retail, aerospace, insurance, and professional sports management. He also has extensive business continuity planning experience as an information security practitioner, manager in the field of information security and business continuity planning, and as a university-level professor.

EMEA Director Global Service Development, David Lynas

David Lynas joined QinetiQ from Netigy Corporation, where he was Director of Global Security Practice. He is an internationally renowned Information Security professional with nearly 20 years experience in the industry. In recent years David has specialized in designing strategic security architecture and has led many successful engagements for companies in the finance, healthcare, telecommunications, chemicals, manufacturing, and technology sectors all over the world, and for government in Europe and the USA. David continues to be in high demand as a presenter having delivered sessions and keynotes on more than thirty different aspects of security to international conferences on four continents. He is the founder and chair of the prestigious annual COSAC conference.

EMEA Director Education, Christine Cambridge

Christine Cambridge has a software engineering background and has contributed to various information security research projects at QinetiQ for over 9 years. Specializing in requirement management, Christine later moved on to project manage a multi-million pound Information Security research program, acting as program conduit for international collaboration with other governments, research establishments and industry. She has contributed to the writing of military IT standards and participated in related Information Security steering groups. Over the past four years Christine has built one of the largest, most successful commercial Penetration Testing teams within the UK

EMEA Director Consulting, John Sherwood BSc MSc CEng FBCS CMC CISSP

John Sherwood is the Director of Professional Services (EMEA) within QinetiQ Trusted Information Management and is one of the key players transforming that company into a global world-class provider of Information Security Services. He has 31 years experience as an information-systems professional, the last 16 of which have been as a specialist in security of business information systems. The great majority of this security experience is in the banking and finance industry, but covers also aerospace, chemicals, oil & gas, telecommunications and government. Previous appointments include: Practice Director EMEA at Netigy (Feb 2001 - Sept 2001); Executive Director Architecture at Netigy (Jan 2000 - Feb 2001), Managing Director at Sherwood Associates Limited (Feb 1990 - Dec 1999); Managing Director at Computer Security Consultants Limited (Jan 1989 - Jan 1990); Systems Support Manager at Computer Security Limited (September 1985 - December 1988); Principal Lecturer, Software Engineering & Digital Communications Systems, De Montford University, Leicester (July 1983 - August 1985). John is also a visiting lecturer and external examiner at Royal Holloway College, University of London, and has published and lectured extensively around the world on a broad range of topics in the information security domain.

EMEA Director Forensics & Incident Response, Dave Bacon

David has 13 years experience working in the Metropolitan Police Force where he was involved in a number of investigations including the Lockerbie bombing, The Marchioness sinking and the Mardi Gras bomber. In 1994 was recruited to Computer Crime Unit at Scotland Yard and investigated computer systems and telephone networks for major offences of computer misuse (hacking) and telephone misuse (phreaking). Other areas of investigation include murder, kidnapping, rape and robbery as well as specialized fraud investigations including Airline Ticket, Travel Agency and Mortgage fraud. Recruited to DERA in 1998 and formed the Data Recovery & Computer Forensics Laboratory. Currently Director of Digital Investigations Services for QinetiQ, responsible for all QinetiQ Incident Response, Computer Forensic and Data Recovery services, as well as Secure Data Deletion and Special Projects offerings.

EMEA Director Research, Andy Bates

Andy Bates has over 20 years experience in research and development in the Internet and IT security area. In the early 1980s at the UK Royal Signals and Radar Establishment (RSRE), he was involved in the pioneering research to develop the Internet. This was followed by several years as an Internet consultant supporting many advanced technology projects. He became involved in information security in the late 1980s when he took on responsibility for the research and development of a state of the art multi-level secure distributed system. In recent years, within QinetiQ, he has been responsible for the strategic direction, management and growth of one of the largest and world class trusted information management research and development teams.

US Director of Technology, Peter Stephenson CPE, PCE

Peter Stephenson has lectured and delivered consulting engagements in eleven countries plus the United States on network planning, implementation, technology and security, and has written, co-authored or contributed to 14 books and several hundred articles in major trade publications. He began his professional career in 1965. Prior to joining QinetiQ Trusted Information Management, Inc. as U.S. Director of Technology, he was the Director of Technology for the global security practice of Netigy Corporation He operated his own information security consulting practice for over 15 years. He is the developer of the Intrusion Management model, the VAST method for vulnerability assessment, the S-TRAIS standards-based security requirements engineering method, and the End-to-End Digital Forensic Analysis technique for conducting digital investigations over large networks. Mr. Stephenson currently is a PhD candidate at Oxford-Brookes University where his research involves intrusion detection in a forensic environment. He holds the professional designations Certified Professional Engineer (CPE) and Professional Computer Engineer (PCE) from the International Society of Professional Engineers.

Vice President Solution Sales & Marketing, Americas, Keith Franz

Mr. Franz joined QinetiQ Trusted Information Management in March 2002 as Vice President Solution Sales and Marketing. Mr. Franz has over 28 years of sales and marketing experience and is a frequent speaker on a variety of security topics. His experience includes the use of security tools such as firewalls, intrusion detection and monitoring; Internet, mid-tier and mainframe system security; and security practices. Prior to joining QinetiQ Trusted Information Management, he served as Vice President - Sales for RedSiren Corporation. Mr. Franz also has held executive sales positions at such companies as Axent Technologies, Tartan, Inc. and Ansoft Corporation and sales positions with IBM and Wang Laboratories.

Vice President & Contracts Officer, Marie Fogarty

As Vice President & Contracts Officer for QinetiQ Trusted Information Systems, Inc., Ms. Fogarty brings over 15 years of legal expertise to her role, with a concentrated focus on representing professional services companies in the high tech marketplace. Ms. Fogarty was the Assistant General Counsel at Netigy Corporation, responsible for the direct legal support of the Eastern Region, Global Security Consulting Practice Group and the Channel/Alliances Organization. While at Netigy, Ms. Fogarty provided a full range of legal support and business counseling on all information technology contracts in the commercial and government context, including customized professional services engagements, systems integration, outsourcing, and packaged network security offerings. Prior to joining Netigy, Ms. Fogarty held senior level legal positions at several major IT vendors including Electronic Data Systems, Sun Microsystems and MCI Systemhouse. At MCI Systemhouse Ms. Fogarty was US Corporate Counsel in charge of a team of legal professionals and responsible for over $1 billion in business on an annual basis. Prior to entering in-house practice, Ms. Fogarty was associated with Sherman & Sterling and Cadwalader, Wickersham & Taft in New York, where her practice focused on the representation of financial institutions and mergers & acquisition transactions involving Fortune 500 companies.

Ms. Fogarty received her law degree with honors in 1987 from Cornell University Law School in Ithaca, New York, and graduated magna cum laude in 1984 with a degree in English and Sociology from the State University of New York at Binghamton.



Auerbach: Information Security Management
Reengineering the Continuity Planning Process
By: Carl B. Jackson
Vice President - Continuity Planning, QinetiQ-TIM

Forward

The initial version of this chapter was written for the 1999 Edition of the "Handbook of Information Security Management." Since then, eCommerce has seized the spotlight, and Web-based technologies are the emerging solution for almost everything! The constant throughout these occurrences is that no matter what the climate, fundamental business processes have changed little. And, as always, the focus of any business impact assessment is to assess the time-critical priority of these business processes. With these more recent realities in mind, this chapter has been updated and is now offered for your consideration.

  • CP: Management Awareness High-Execution Effectiveness Low

The failure of organizations to accurately measure the contributions of the Continuity Planning (CP) process to their overall success has led to a downward spiraling cycle of the total business continuity program. The recurring downward spin or decomposition includes planning, testing, maintenance, decline->>-re-planning, testing, maintenance, decline->>-re-planning, testing, maintenance, decline, etc.

In the past, Contingency Planning & Management (CPM)/Ernst & Young Continuity Planning Benchmark surveys have repeatedly confirmed that CP is ranked as being either extremely important or very important to executive management. The most recent 2000-2001 CPM/KPMG Continuity Planning Survey1 clearly supports this observation. This study indicates that a growing number of CP professional positions are migrating from the IT infrastructure to corporate or general management positions; however, CP reporting within the IT organization is still the norm. Approximately 40 percent of CP professionals currently report to IT, while around 30 percent report to corporate positions.

Continuity Planning Measurements

While the trends of this survey are encouraging, there is a continuing indication of a disconnect between executive management's perceptions of CP objectives and the manner in which they measure its value. Traditionally, CP effectiveness was measured in terms of a pass/fail grade on a mainframe recovery test or on the perceived benefits of backup/recovery sites and redundant telecommunications weighed against the expense for these capabilities. The trouble with these types of metrics is that they only measure CP direct costs and/or indirect perceptions as to whether a test was effectively executed. These metrics do not indicate whether a test validates the appropriate infrastructure elements or even whether it is thorough enough to test a component until it fails, thereby extending the reach and usefulness of the test scenario.

So, one might inquire as to what are the correct measures to use? While financial measurements do constitute one measure of the CP process, others measure the CPs contribution to the organization in terms of quality and effectiveness, which are not strictly weighed in monetary terms. The contributions that well-run CP Process can make to an organization include:

      (1) Sustaining growth and innovation;

      (2) Enhancing customer satisfaction;

      (3) Providing people needs;

      (4) Improving overall mission critical process quality; and

      (5) Providing for practical financial metrics.

  • A Receipt for Radical Change: CP Process Improvement

Just prior to the millennium, experts in organizational management efficiency began introducing performance process improvement disciplines. These process improvement disciplines have been slowly adopted across many industries and companies for improvement of general manufacturing and administrative business processes. The basis of these and other improvement efforts was the concept that an organization's processes (Process-see Definitions in Table 1) constituted the organization's fundamental lifeblood and, if made more effective and efficient, could dramatically decrease errors and increase organizational productivity.

An organization's processes are a series of successive activities, and when they are executed in the aggregate, they constitute the foundation of the organization's mission. These processes are intertwined throughout the organization's infrastructure (individual business units, divisions, plants, etc.) and are tied to the organization's supporting structures (data processing, communications networks, physical facilities, people, etc.).

A key concept of the Process Improvement and Reengineering movement revolves around identification of process enablers and barriers (see Definitions in Table 1). These enablers and barriers take many forms (people, technology, facilities, etc.) and must be understood and taken into consideration when introducing radical change into the organization.

The preceding narration provides the backdrop for the idea of focusing on continuity planning not as a project, but as a continuous process, that must be designed to support the other mission-critical processes of the organization. Therefore, the idea was born of adopting a continuous process approach to CP, along with understanding and addressing the people, technology, facility, etc. enablers and barriers. This constitutes a significant or even radical change in thinking from the manner in which we have traditionally viewed and executed recovery planning.

Radical Changes Mandated

High awareness of management and low CP execution effectiveness, coupled with the lack of consistent and meaningful CP measurements call for radical changes in the manner in which we execute recovery planning responsibilities. The techniques used to develop mainframe oriented disaster recovery (DR) plans of the 1980s and 1990s consisted of five to seven distinct stages, depending upon whose methodology you were using, that required the recovery planner to:

      (1) Establish a project team and a supporting infrastructure to develop the plans;

      (2) Conduct a threat or risk management review to identify likely threat scenarios to be addressed in the recovery plans;

      (3) Conduct a business impact analysis (BIA) to identify and prioritize time-critical business applications/networks and determine maximum-tolerable-downtimes;

      (4) Select an appropriate recovery alternative that effectively addressed the recovery priorities and time-frames mandated by the BIA;

      (5) Document and implement the recovery plans; and

      (6) Establish and adopt an ongoing testing and maintenance strategy.

Shortcomings of the Traditional Disaster Recovery Planning Approach

The old approach worked well when disaster recovery of glass house mainframe infrastructures was the norm. It even worked fairly well when it came to integrating the evolving distributed/client-server systems into the overall recovery planning infrastructure. However, when organizations became concerned with business unit recovery planning, the traditional DR methodology was ineffective in designing and implementing business unit/function recovery plans. Of primary concern when attempting to implement enterprise-wide recovery plans was the issue of functional interdependencies. Recovery planners became obsessed with identification of interdependencies between business units and functions and the interdependencies between business units and the technological services supporting time-critical functions within these business units.

Losing Track of the Interdependencies

The ability to keep track of departmental interdependencies for CP purposes was extremely difficult and most methods for accomplishing this were ineffective. Numerous circumstances made consistent tracking of interdependencies difficult to achieve. Circumstances affecting interdependencies revolve around rapid rates of change that most modern organizations are going through. These include reorganization/restructuring, personnel relocation, changes in the competitive environment, and outsourcing. Every time an organizational structure changes, the CPs must change and the interdependencies must be reassessed, and the more rapid the change, the more daunting the CP reshuffling. Because many functional interdependencies could not be tracked, CP integrity was lost and the overall functionality of the CP was impaired. There seemed to be no easy answers to this dilemma.

Interdependencies Are Business Processes

Why are interdependencies of concern and what, typically, are the interdependencies? The answer is that, to a large degree, these interdependencies are the business processes of the organization and they are of concern because they must function in order to fulfill the organization's mission. Approaching recovery planning challenges with a business process viewpoint can, to a large extent, mitigate the problems associated with losing interdependencies, and also ensure that the focus of recovery planning efforts is on the most crucial components of the organization. Understanding how the organization's time-critical business processes are structured will assist the recovery planner in mapping the processes back to the business units/departments, supporting technological systems, networks, facilities, vital records, people, etc., and also will facilitate keeping track of the processes during reorganizations and/or during times of change.

  • The Process Approach to Continuity Planning

Traditional approaches to mainframe-focused disaster recovery planning emphasized the need to recover the organization's technological and communications platforms. Today, many companies have shifted away from technology recovery and toward continuity of prioritized business processes and the development of specific business process recovery plans. Many large corporations use the process reengineering/improvement disciplines to increase overall organizational productivity. CP itself should also be viewed as such a process. The following figure provides a graphical representation of how the enterprise-wide CP Process framework (Figure 1) should look:

Figure 1

This approach to Continuity Planning approach consolidates three traditional continuity-planning disciplines, as follows:

  • IT Disaster Recovery Planning (DRP) - Traditional Disaster Recovery Planning addresses the continuity planning needs of the organizations' IT infrastructures, including centralized and decentralized IT capabilities and includes both voice and data communications network support services.

  • Business Operations Resumption Planning (BRP) - Traditional BRP addresses continuity of an organization's business operations (i.e., Accounting, Purchasing, etc.) should they lose access to their supporting resources (i.e., IT, communications network, facilities, external agent relationships, etc.).

  • Crisis Management Planning (CMP) - CMP focuses on assisting the client organization develop an effective and efficient enterprise-wide emergency/disaster response capability. This response capability includes forming appropriate management teams and training their members in reacting to serious company emergency situations (i.e., hurricane, earthquake, flood, fire, serious hacker or virus damage, etc.). CMP also encompasses response to life-safety issues for personnel during a crisis or response to disaster.

  • Continuous Availability (CA) - In contrast to the other Continuity Planning components as explained above, the recovery time objective (RTO) for recovery of infrastructure support resources in a 24x7 environment has diminished to zero time. That is to say that the client organization cannot afford to lose operational capabilities for even a very short period of time without significant financial (revenue loss, extra expense) or operational (customer service, loss of confidence) impact. The CA service focuses on maintaining the highest uptime of support infrastructures to 99% and higher.

  • Moving To A CP Process Improvement Environment

Route Map Profile and High-Level CP Process Approach

A practical, high-level approach to CP Process Improvement is demonstrated by breaking down the CP process into individual sub-process components as shown in the following figure (Figure 2):

Figure 2

The six major processes of the Continuity Planning business process are described below:

  • Current State Assessment/Ongoing Assessment - Understanding that the approach to enterprise-wide continuity planning is illustrated in Figure 2 above, measure the `health' of the Continuity Planning Process. During this process, existing continuity planning business sub-processes are assessed to gauge their overall effectiveness. It is sometimes useful to utilize gap analysis techniques to understand current state, desired future state, and then understand the people, process, and technology barriers and enablers that stand between the current state and the future state. An approach to co-development of current state/future state visioning sessions is illustrated in Figure 3.

Figure 3 - Current State/Future State Visioning Overview

    The Current State Assessment process also involves identifying and/or determining how the organization `values' the CP process and measures its success (often overlooked and often leads to the failure of the CP process). Also during this process, an organization's business processes are examined to determine the impact of loss or interruption of service on the overall business through performance of a business impact assessment (BIA). The goal of the BIA is to prioritize business processes and assign the recovery time objective (RTO) for their recovery as well as for the recovery of their support resources. An important outcome of this activity is the mapping of time-critical processes to their support resources (i.e., IT applications, networks, facilities, communities of interest, etc.).

  • Process Risk & Impact Baseline - During this process, potential risks and vulnerabilities are assessed and strategies and programs developed to mitigate or eliminate those risks. The stand-alone risk management review (RMR) commonly looks at the security of Physical, Environmental, and Information capabilities of the organization. In general, the RMR should identify or discuss seven basic areas:

    • Potential threats,

    • Physical and environmental security,

    • Information security,

    • Recoverability of time-critical support functions,

    • Single-points-of-failure,

    • Problem and change management,

    • Business interruption and extra expense insurance,

    • An Off-Site Storage Program, etc.

  • Strategy Development - This process involves facilitating a workshop or series of workshops designed to identify and document the most appropriate recovery alternative to CP challenges (i.e., determining if a `hotsite' is needed for IT continuity purposes, determining if additional communications circuits should be installed in a Networking environment, determining if additional workspace is needed in a business operations environment, etc.). Using the information derived from the risk assessments above, design long-term testing, maintenance, awareness, training and measurement strategies.

  • Continuity Plan Infrastructure - During plan development, all policies, guidelines, continuity measures, and Continuity Plans are formally documented. Structure the CP environment to identify plan owners and project management teams, to ensure the successful development of the plan. In addition, tie the continuity plans to the overall IT continuity plan and Crisis Management Infrastructure.

  • Implementation - During this phase, the initial versions of the continuity and/or crisis management plans are implemented across the enterprise environment. Also during this phase, long-term testing, maintenance, awareness, training and measurement strategies are implemented.

  • Operate Environment - This phase involves the constant review and maintenance of the continuity and crisis management plans. In addition, this phase may entail maintenance of the ongoing viability of the overall continuity and crisis management business processes.

  • How Do We Get There? The Concept of the CP Value Journey

The CP Value Journey is a helpful mechanism for co-development of CP expectations by the organization's top management group and those responsible for recovery planning. In order to achieve a successful and measurable recovery planning process, the following checkpoints along the CP Value Journey should be considered and agreed upon. The checkpoints include:

    • Defining Success - Define what a successful CP implementation will look like. What is the Future State?

    • Aligning the CP with Business Strategy - Challenge objectives to ensure that the CP effort has a business-centric focus.

    • Charting an Improvement Strategy - Benchmark where the organization and the organization's peers are, the organization's goals based upon their present position as compared to their peers, and which critical initiatives will help the organization achieve its goals.

    • Becoming an Accelerator - Accelerate the implementation of the organization's CP strategies and processes. In today's environment, speed is a critical success factor for most companies.

    • Creating a Winning Team - Build an internal/external team that can help lead the company through CP assessment, development, and implementation.

    • Assessing Business Needs - Assess time critical business process dependence on the supporting infrastructure.

    • Documenting the Plans - Develop continuity plans that focus on assuring that time-critical business processes will be available.

    • Enabling the People - Implement mechanisms that help enable rapid reaction and recovery in times of emergency, such as training programs, a clear organizational structure, and a detailed leadership and management plan.

    • Completing the Organization's CP Strategy - Position the organization to complete the operational and personnel related milestones necessary to ensure success.

    • Delivering Value - Focus on achieving the organization's goals while simultaneously envisioning the future and considering organizational change.

    • Renewing/Recreating - Challenge the new CP process structure and organizational management to continue to adapt and meet the challenges of demonstrate availability and recoverability.

The Value Journey Facilitates Meaningful Dialogue

This Value Journey technique for raising the awareness level of management helps to both facilitate meaningful discussions about the CP Process and to ensure that the resulting CP strategies truly add value. As will be discussed later, this value-added concept will also provide additional metrics by which the success of the overall CP process can be measured.

  • The Need for Organizational Change Management

In addition to the approaches of CP Process Improvement, and the CP Value Journey mentioned above, the need to introduce people-oriented Organizational Change Management (OCM) concepts is an important component in implementing a successful CP process.

Mr. H. James Harrington, et al, in their book Business Process Improvement Workbook2, point out that applying process improvement approaches can often cause trouble unless the organization manages the change process. They state that, "Approaches like reengineering only succeed if we challenge and change our paradigms and our organization's culture. It is a fallacy to think that you can change the processes without changing the behavior patterns or the people who are responsible for operating these processes."3

Organizational change management concepts, including the identification of people enablers and barriers and the design of appropriate implementation plans which change behavior patterns, play an important role in shifting the CP project approach to one of CP Process Improvement. The authors also point out that, "There are a number of tools and techniques that are effective in managing the change process, such as pain management, change mapping, and synergy. The important thing is that every BPI (Business Process Improvement) program must have a very comprehensive change management plan built into it, and this plan must be effectively implemented."4

Therefore, it is incumbent on the recovery planner to ensure that, as the concept of the CP Process evolves within the organization, appropriate OCM techniques are considered and included as an integral component of the overall deployment effort.

  • How Do We Measure Success? Balanced Scorecard Concept5

A complement to the CP Process Improvement approach is the establishment of meaningful measures or metrics that the organization can use to weigh the success of the overall CP process. Traditional measures include:

  • How much money is spent on hotsites?

  • How many people are devoted to CP activities?

  • Was the hotsite test a success?

Instead, the focus should be on measuring the CP Process contribution to achieving the overall goals of the organization. This focus helps us to:

  • Identify agreed upon CP development milestones

  • Establish a baseline for execution

  • Validate CP Process delivery

  • Establish a foundation for management satisfaction in order to successfully manage expectations


The CP Balanced Scorecard includes a definition of the:

  • Value Statement

  • Value Proposition

  • Metrics/Assumptions on reduction of CP risk

  • Implementation Protocols

  • Validation Methods

Figures 4 and 5 illustrate the Balanced Scorecard concept and show examples of the types of metrics that can be developed to measure the success of the implemented CP Process. Included in this Balanced Scorecard approach are the new metrics upon which the CP Process will be measured.

Following this Balanced Scorecard approach, the organization should define what the Future State of the CP Process should look like (see the preceding CP Value Journey discussion). This Future State definition should be co-developed by the organization's top management and those responsible for development of the CP Process infrastructure. Figure 3 illustrates the Current State/Future State Visioning Overview, a technique that can also be used for developing expectations for the Balanced Scorecard. Once the Future State is defined, the CP Process development group can outline the CP Process implementation critical success factors in the areas of:

  • Growth and innovation

  • Customer satisfaction

  • People

  • Process quality

  • Financial state

These measures must be uniquely developed based upon the specific organization's culture and environment.

Figure 4 - Balanced Scorecard Concept

Figure 5 - Continuity Process Score Card Example

  • What about Continuity Planning for Web-Based Applications?

Evolving with the birth of the web and web-based businesses is the requirement for 24x7 uptime. Traditional recovery time objectives have disappeared for certain business processes and support resources that support the organizations' web-based infrastructure. Unfortunately, simply preparing web-based applications for sustained 24x7 uptime is not the only answer. There is no question that application availability issues must be addressed, but it is also important that reliability and availability of other web-based infrastructure components such as computer hardware, web-based networks, database file systems, web servers, file and print servers as well as preparing for the physical, environmental, and information security concerns relative to each of these (See RMR above) is also undertaken. The terminology for preparing the entirety of this infrastructure to remain available through major and minor disruptions is usually referred to as Continuous or High Availability.

Continuous Availability (CA) is not simply bought; it is planned for and implemented in phases. The key to a reliable and available web-based infrastructure is to ensure that each of the components of the infrastructure have a high-degree of resiliency and robustness. To substantiate this statement, Gartner Research reports "Replication of databases, hardware servers, web servers, application servers and integration brokers/suites helps increase availability of the application services. The best results, however, are achieved when, in addition to the reliance on the system's infrastructure, the design of the application itself incorporates considerations for continuous availability. Users looking to achieve continuous availability for their web applications should not rely on any one tool but should include the availability considerations systematically at every step of their application projects."7

Implementing a Continuous Availability methodological approach is the key to an organized and methodical way to achieve 24x7 or near 24x7 availability. Begin this process by understanding business process needs and expectations the vulnerabilities and risks of the network infrastructure (e.g. Internet, Intranet, Extranet, etc.) including undertaking single-points-of-failure analysis. As part of considering implementation of Continuous Availability, the organization should examine the resiliency of the their network infrastructure and the components thereof including the capability of the their infrastructure management systems to handle network faults, network configuration and change, the ability to monitor network availability, and the ability of individual network components to handle capacity requirements. See Figure 6 for an example pictorial representation of this methodology:

Figure 6 - Continuous Availability Methodological Approach

The CA methodological approach is a systematic way to consider and move forward on achieving a web-based environment. A very high-level overview of this methodology is as follows:

  • Assessment/Planning - During this phase, the enterprise should endeavor to understand the current state of business process owner expectations/requirements and the components of the technological infrastructure that support web-based business processes. Utilizing both interview techniques (people to people) and existing system and network automated diagnoses tools will assist in understanding availability status and concerns.
  • Design - Given the results of the current state assessment, design the continuous availability strategy and implementation/migration plans. This would include developing a web-based infrastructure classification system to be used to classify the governance processes to be used for granting access to and use of support for web-based resources.
  • Implementation - Migrate existing infrastructures to the web-based environment according to design specifications as determined during the Design phase.
  • Operations/Monitoring - Establish operational monitoring techniques and processes for the ongoing administration of the web-based infrastructure.

Along these lines, in their book Blueprints for High Availability: Designing Resilient Distributed Systems8 Marcus and Stern recommend several fundamental rules for maximizing system availability (paraphrased):

  • Spend Money...but Not Blindly - since quality costs money, investing in an appropriate degree resiliency in necessary.
  • Assume Nothing - Nothing comes bundled when it comes to continuous availability. End to end system availability requires up front planning and cannot simply be bought and dropped in place.
  • Remove Single-Points-of-Failure - If a single link in the chain breaks, regardless of how strong the other links are, the system is down. Identify and mitigate single-points-of-failure.
  • Maintain Tight Security - Provide for the physical, environmental and information security of your web-based infrastructure components.
  • Consolidate Your Servers - Consolidate many small servers' functionality onto larger servers and less numerous servers to facilitate operations and reduce complexity.
  • Automate Common Tasks - Automate the commonly performed systems tasks. Anything that can be done to reduce operational complexity will assist in maintaining high availability.
  • Document Everything - Do not discount the importance of system documentation. Documentation provides audit trails and instructions to present and future systems operators on the fundamental operational intricacies of the systems in question.
  • Establish Service Level Agreements (SLA) - It is most appropriate to define enterprise and service provider expectations ahead of time. SLAs should address system availability levels, hours of service, locations, priorities, and escalation policies.
  • Plan Ahead - Planning for emergencies and crisis that include multiple failures in advance of actual events.
  • Test Everything - Test all new applications, system software, and hardware modifications in a production-like environment prior to going live.
  • Maintain Separate Environments - Provide for separation of systems, when possible. This separation might include separate environments for the following functions: production, production mirror, quality assurance, development, laboratory, and disaster recovery/business continuity site.
  • Invest in Failure Isolation - Plan to isolate problems so that if or when they occur, that they cannot boil over and affect other infrastructure components, to the degree possible.
  • Examine the History of the System - Understanding system history will assist in understanding what actions are necessary to move the system to a higher level of resiliency in the future.
  • Build for Growth - A given in the modern computer era is that reliability on system increases over time. As enterprise reliance on system resources grow, the systems must grow. Therefore, adding systems resources to existing reliable system architectures require preplanning and concern for workload distribution and application leveling.
  • Choose Mature Software - It should go without saying that mature software that supports a web-based environment is preferred over untested solutions.
  • Select Reliable and Serviceable Hardware - As with software, select hardware components that have demonstrated high mean time between failures is preferable in a web-based environment.
  • Reuse Configurations - If the enterprise has stable system configurations, reuse or replicate them as much as possible throughout the environment. The advantages of this approach is that there is: ease of support, pretested configurations, a high degree of confidence for new rollouts, bulk purchasing possible, spare parts are available, and there is less to learn for those responsible for implementing and operating the web-based infrastructure.
  • Exploit External Resources - Take advantage of others that are implementing and operating web-based environments. It is possible to learn from others experiences.
  • One Problem, One Solution - Understand, identify, and utilize the tools necessary to maintain the infrastructure. The tool should fit the job, so obtain them and use them as they were designed to be used.
  • KISS: Keep It Simple...- Simplicity is the key to planning, developing, implementing and operating a web-based infrastructure. Endeavor to minimize web-based infrastructure points of control and contention, and the introduction of variables.

The Marcus and Stern book8 is an excellent reference for preparing for and implementing highly available systems.

Reengineering the continuity planning process involves not only reinvigorating continuity planning processes but also ensuring the web-based enterprise needs and expectations are identified and met through implementation of continuous availability disciplines.

  • Chapter Summary

The failure of organizations to measure the success of their CP implementations has led to an endless cycle of plan development and decline. The primary reason for this is that a meaningful set of CP measurements has not been adopted to fit the organization's future state goals. Because these measurements are lacking, expectations of both top management and those responsible for CP often go unfulfilled. Statistics gathered in the Contingency Planning & Management/KPMG Continuity Planning Survey support this assertion. Based on this, a radical change in the manner in which organizations undertake CP implementation is necessary. This change should include adopting and utilizing the Business Process Improvement (BPI) approach for CP. This BPI approach has been implemented successfully at many Fortune 1000 companies over the past twenty years. Defining CP as a process, applying the concepts of the CP Value Journey, expanding CP measurements utilizing the CP Balanced Scorecard, and exercising the Organizational Change Management (OCM) concepts will facilitate a radically different approach to CP. Finally, since web-based business processes require 24x7 uptime, implementation of continuous availability disciplines are necessary to ensure that the CP process is as fully developed as it should be.




Table 1

Definitions6:

Activities - Activities are things that go on within a process or sub-process. They are usually performed by units of one (one person or one department). An activity is usually documented in an instruction. The instruction should document the tasks that make up the activity.

Benchmarking - Benchmarking is a systematic way to identity, understand, and creatively evolve superior products, services, designs, equipment, processes, and practices to improve the organization's real performance by studying how other organizations are performing the same or similar operations.

Business Process Improvement (BPI) - Business Process Improvement is a methodology that is designed to bring about self-function improvements in administrative and support processes using approaches such as FAST, process benchmarking, process redesign, and process reengineering.

Comparative Analysis - Comparative Analysis is the act of comparing a set of measurements to another set of measurements for similar items.

Enabler - An enabler is a technical or organizational facility/resource that make it possible to perform a task, activity, or process. Examples of technical enablers are personal computers, copying equipment, decentralized data processing, voice response, etc. Examples of organizational enablers are enhancement, self-management, communications, education, etc.

FAST - Fast Analysis Solution Technique is a breakthrough approach that focuses a group's attention on a single process for a one or two-day meeting to define how the group can improve the process over the next 90 days. Before the end of the meeting, management approves or rejects the proposed improvements.

Future State Solution - is a combination of corrective actions and changes that can be applied to the item (process) under study to increase its value to its stakeholders.

Information - Information is data that has been analyzed, shared, and understood.

Major Processes - A major process is a process that usually involves more than one function within the organization structure, and its operation has a significant impact on the way the organization functions. When a major process is too complex to be flowcharted at the activity level, it is often divided into sub-processes.

Organization - An organization is any group, company, corporation, division, department, plant, or sales office.

Process - A process is a logical, related, sequential (connected) set of activities that takes an input from a supplier, adds value to it, and produces an output to a customer.

Sub-process - A sub-process is a portion of a major process that accomplishes a specific objective in support of the major process.

System - A system is an assembly of components (hardware, software, procedures, human functions, and other resources) united by some form of regulated interaction to form an organized whole. It is a group of related processes that may or may not be connected.

Tasks - Tasks are individual elements and/or subsets of an activity. Normally, tasks related to how an item performs a specific assignment.


References:

  1. Contingency Planning & Management, January/February 2001. (The survey was conducted in the U.S. in October 2000 and consisted of readers and respondents drawn from Contingency Planning & Management magazine's domestic subscription list. Industries represented by respondents include Financial Services; Manufacturing/Industrial, Telecommunications, Education, Utilities, Healthcare, Insurance, Retail/Wholesale, Petroleum/Chemical, Information/Data Processing, Media/Entertainment; and Computer Services/Systems.)

  2. H. James Harrington, Erick K. C. Esseling, Harm Van Nimwegen, Business Process Improvement Workbook, McGraw-Hill, 1997.

  3. Harrington, p 18.

  4. Harrington, p 19.

  5. Robert S. Kaplan, David P. Norton, Translating Strategy Into Action: The Balanced Scorecard, HBS Press, 1996.

  6. Harrington, pp. 1-20.

  7. Gartner Group RAS Services, COM-12-1325, 29 September 2000.

  8. Evan Marcus and Hal Stern, Blueprints for High Availability: Designing Resilient Distributed Systems, John Wiley & Sons, 2000.