Issue Archive: January/February 2008
Disaster Recovery: Doing the Basics Better
Author: Mark Hughes and Robert DiLossiAs many are all too painfully aware, a disaster is an unplanned event, something which cannot be predicted. However, this does not mean that organizations are at the mercy of chance. Organizations need a strong focus on three core elements — people, processes and planning — coupled with decisive leadership, in addition to technology. This is a “back to basics” approach to disaster recovery.
The Heartbeat of an Organization
No matter how advanced an organization’s technology may be, it’s the employees who are the heartbeat of a company, since they recover the environment and resume business processes at the time of disaster. To illustrate this point, an organization that has 100 percent of its data fully replicated and running at a hot site will remain paralyzed if it doesn’t have the right people there who are trained and equipped to take the reins and bring the organization back onto its feet.
Organizations need to ensure the right people are available to execute business recovery processes, based on preplanned guidance and scenarios. Organizations must also plan for the possibility that lower level staff may have to make critical decisions due to a lack of executive team availability. This requires an understanding of who represents the organization’s “brain trust” during the normal course of business, and establishing a “back-up chain of command.” Those representing the brain trust must then prepare their respective “back-ups” in the chain to make decisions in their absence.
A focus on people also requires an organization to comprehensively consider and document a wide range of processes and issues including: Where are the people going? Who’s going, and who determines who is making the trip? How are they going to get there? How long are they going to stay there? If this is a long term recovery, what about their families?
Finally, a focus on people requires clear leadership. Leadership, decisive action and flawless execution are the cornerstones to building a sense among employees that their bosses know what they are doing. Leadership in disaster recovery should not be considered solely an area of IT responsibility. Because disasters and disruptions yield a business impact, they are very much a business — and not just an IT — problem. As such, business executives’ perceptions and behaviors must shift accordingly to uphold and maintain this sense of leadership.
The need for greater business-level involvement and support was brought to life in a recent IDG Research survey which found nearly half of all respondents (primarily CIOs and CSOs) reporting that the business sides of their houses require — and expect without question — an RTO of less than 12 hours. Moreover, nearly one quarter of respondents reported that they expect their recovery windows to decrease even further in the coming year.
In spite of these heightened demands, IT executives and teams see their overall disaster recovery budgets remaining flat at a mere six percent of overall IT budgets. This discrepancy sends a mixed message and can lead to the perception among IT — and employees overall — that disaster recovery is not a top priority and something to which significant time and attention should be devoted. It is not surprising that the IDG survey respondents gave their organizations an average grade of C+/B- for their disaster recovery plans.
Processes and Documentation
When a disaster strikes there are three major steps to begin the process of managing the incident: mobilizing a central command center; activating a business recovery plan; and identifying exactly how long the organization will operate in a recovery state, and planning accordingly. Following closely behind the imperative of managing people is the need for organizations to carefully document their processes, both in terms of how to recover and how to operate. Organizations also need to practice and refine processes using a variety of scenarios.
For example, if a central command center becomes unavailable due to a natural disaster, where is the default command center? What applications and systems should be brought online first, and in what order? Does this answer vary based on the nature of the business disruption? What level of damage to an organization’s primary site warrants a complete move to a new site? At what point are conditions considered safe or stable enough for employees to report back to the primary site, and what is the proper process for communicating this throughout the organization?
Just as new technologies have a tendency to distract focus away from people issues, they may blur an organization’s focus on the need for establishing and documenting processes. The stark reality is that many new technologies introduce complexities that make this need more urgent than ever.
A case in point is virtualization. Through the virtual partitioning of hardware to support multiple operating systems, virtualization can deliver a level of redundancy on a single piece of hardware that can be used for disaster recovery purposes. That is, virtual machines support failover from one partition to another partition in the same server, in the event that the first partition goes down or experiences a disruption. A common mistake is to believe virtual machines do not require the same level of back-up processes and documentation as their non-virtualized counterparts.
But in fact, this couldn’t be further from the truth.
The constantly changing nature of business applications and the IT infrastructures supporting them requires the regular capture and storage of virtualized IT blueprints, also known as images. Images are often set up in-house and undocumented, and the challenge of re-staging an image can be so great that in the event of a disaster, an IT department may have to start from scratch — thereby crippling recovery point and recovery time objectives.
When you consider virtual machines are inherently more critical due to the sheer number of applications they support — and the fact that virtual machines are vulnerable to the same threats of damage and destruction as non-virtualized machines — it becomes clear even greater consideration of back-up processes and documentation for the storage of images is needed. Processes pertaining to the back-up of virtual servers are quite different (and more complex) than more traditional infrastructures and must cover such questions as:
-
Where should virtual images be captured and stored, and at what time intervals?
-
If our primary infrastructure experiences a disruption, how will we ensure images continue to be captured and stored?
-
Should we enlist a third-party who would have power to re-activate our images on a remote, duplicate infrastructure?
Communicating and Practicing
Once processes are established and documented, the next critical step is to effectively communicate processes to employees and thoroughly practice the actual execution of the process parameters, or the business recovery plan.
Communications is one area where many organizations are falling very short. According to the IDG Research survey, almost half (44 percent) of respondents indicated their organizations never communicate an overall business continuity plan to employees. Fifty-nine percent go on to state their organizations do not articulate their organization’s business continuity plan to key external stakeholders.
These responses — indicating a less than sufficient approach — are in stark contrast to an internal and external threat environment that is evolving and growing more sophisticated each day. In fact, over 92 percent of survey respondents reported they have encountered at least one disruptive event over the past year as a result of one of these threats, ranging from natural disasters like earthquakes and floods, to network hackers and disgruntled employees, to more mundane scenarios like network outages or hardware failures. Furthermore, there are potentially disruptive events many organizations do not even consider — like a critical team member getting in a car accident, requiring emergency surgery or being inaccessible on an airplane; or a flu virus outbreak that incapacitates a large group of employees.
Should a potentially disruptive event of any kind occur, the fact that many employees would not be aware of an overall business continuity plan — or their roles within such a plan — is alarming, to say the least.
These responses also seem at odds with heightened expectations for greater corporate transparency and communications on the part of all critical stakeholders, extending beyond employees to partners, customers and shareholders. Stakeholders should expect a comprehensive picture of how their organizations would respond to a business disruption, as a means of holistically understanding their own risk and adopting precautionary, self-protective measures when needed. Here, again, the need for leadership becomes apparent. A commitment to business resiliency that is continually communicated at the highest levels of an organization reinforces a perception that the organization is in good hands and “knows what it is doing.”
Finally, the importance of testing and conducting disaster recovery dry-runs cannot be over-estimated. IDG Research also uncovered an alarming percentage of respondents whose companies do not test their plans often enough for them to be most effective during a disruption. A total of 80 percent indicated they test their disaster recovery plans annually or less often.
In the past, annual testing may have been sufficient to accommodate changes in business processes; however, today, organizations are facing ever-increasing, nearly constant change to mission-critical processes and systems, from staff updates to alterations in hardware and software — sometimes daily. As a result, business continuity plans become outdated much more quickly than in the past. An organization that tests their plans once a year — or less often — may be able to recover, but the road to recovery is apt to be longer and riddled with speed bumps that can take their toll on business resumption and corporate reputation.
Dry-runs and tests need to be conducted in as close to a real-world simulation as possible, and must involve anyone and everyone with a role in executing the business recovery plan — including senior executives. We have found people undergoing very real-world simulations are sometimes so panicked they can’t even dial 911, so repetition is also key. A good rule of thumb is — if people aren’t running, your test isn’t effective.
Here’s an example of testing done right. Recently, a large international financial corporation declared a disaster, and due to the fact that they tested their people, processes and plans quarterly, their incident management process went off without a hitch. The organization placed their employees on stand-by with instructions to possibly report to their backup location; disaster mobilization was initiated and all employees were accounted for and assignments distributed. This was in the height of summer vacation season when many employees were out, so their communications processes proved highly effective. Designated employees arrived at the organization’s hot site and worked for several days until it was deemed safe to return to their primary location. The business never skipped a beat.
Wrapping Up
Consider the image of Rudy Giuliani flanked by the NYC Fire Commissioner and NYC Police Commissioner post 9/11. They had run tests and leveraged valuable experience to bring a real-world situation under control, and they bestowed trust and confidence in the public when we needed it the most. While most potential disruptions will pale in comparison to 9/11, non-government entities must be able to address their critical stakeholders in an equally effective manner in the days and weeks following a disruption. This is the key to maintaining accountability, preserving order and conveying a sense that the situation is under control by smart, equipped people.
By preserving their ability to function, commercial organizations play an equally fundamental role in maintaining societal continuity and upholding national security — so why have they not been taking the appropriate steps? Non-government entities need to follow government’s example and plan for their people to deal with any type of theoretical situation which may come to pass. This means doing the basics better and making sure people, processes and plans are backed up and tested properly.
Only once these basics are mastered can technology yield a positive impact. Quite simply, the best technology in the world is not going to save you if people, processes and planning aren’t paid due diligence first. CI
Mark Hughes is the executive vice president of operations for SunGard Availability Services. He can be reached at Mark.Hughes@sungard.com. Robert DiLossi is the director of crisis management for SunGard Availability Services. He can be reached at Robert.DiLossi@sungard.com.

