Your IT organization probably has a disaster recovery (DR) plan – but have you really thought through what it would take to implement it if you had to?
You may have watched on live TV recently as first United Airlines and then the New York Stock Exchange ground to a halt as a result of computer systems failure. United endured a worldwide ground halt for around 90 minutes caused by a loss of connectivity between systems that needed to connect to each other for the overall business to operate. Annoying and embarrassing for the business and its customers, but recoverable — it wasn’t on live TV the whole time and operations were back to nearly normal within a few hours.
The NYSE wasn’t so lucky. It was down for more than four hours, on live TV the entire time, and only just able to complete the close-of-market process at the end of the day. Its “glitch” seems to have been an error introduced during a software update, also causing several systems that need to connect to lose some of that connection capability.
Fortunately for the market as a whole, all the other exchanges continued to function and the loss of 20% of trading capacity in New York was absorbed easily by the other players. Even the floor traders in New York could make trades elsewhere. So the incident was not a disaster, but highly embarrassing nonetheless.
Both United and the NYSE have detailed plans regarding what to do if their primary computer facilities aren’t available. While the details aren’t public, most such plans have a similar model: Make sure that the state of critical information is maintained. Recreate that state at an alternate site. Cancel any transactions past a last know good state (called the “recovery point” in DR speak). Restart the systems at the recovery site as of the recovery point. Rerun any cancelled transactions to “catch up” to the point of failure. Verify all is good. Restart services to customers.
The elapsed time to do this (the recovery time) can vary from a few milliseconds to several hours or even a couple of days. How your DR process is actually architected depends on how long a recovery time you can tolerate.
In any large-scale modern business, however, it’s seldom that simple.
In addition to moving all the data and software need to run your business systems, every external connection to your data center has to be “pointed” to the recovery site. In the worst case, some of your customer or supplier systems may also have been impacted (think hurricane or earthquake or major regional power outage), and they will also be moving to their backup sites, which will have to be redirected to connect to your backup site.
With possibly thousands of such connections (when I did this for a living, I once had to manage over 80,000 external connections, many hardwired), you will need to work hand in hand with your wide area network vendors to make all this possible in a reasonable time, but they may be impacted too — or overwhelmed by similar requests from other area businesses.
And what happens when you also have to move some or all of your people from where they normally work to another site, which may not have all the facilities they are accustomed to having available?
So activating a disaster recovery plan isn’t an easy decision. Your incident response team will have to judge how long it will take to fix the problems you’re facing and assess whether it’s better to wait than to switch over to the DR site. This should seldom be an IT- only decision; there will be business impacts no matter what you do, so the final call should be with the business. And then there’s the issue that too many DR plans never address — how do you return to normal operations when the incident that triggered the DR plan is resolved?
Neither United nor the NYSE activated their DR plans. Both elected (correctly in my view) to fix the issue in situ, although for the NYSE it must have been an increasingly tough call. It would have been a lot tougher if there hadn’t been a dozen other exchanges able to handle trades. And moving to a DR stance may not have helped if the problem was connections between systems rather than the systems themselves.
You’ll probably never have to make those kinds of choices on live TV with billions of transactions at risk. But it’s worth asking just how well your DR plan would work in practice and what scenarios would cause you to set it in motion instead of trying for a fix.
I’m sure you test the plans at least once a year. Maybe you even run “for real” occasionally at the DR site, in which case you know you can both get there and return. But if you’ve never done so, best of luck.