Delta’s Outage Sparks Memories of Past IT Disasters
Guest Post By David Feinglass, Solutions Engineer at Zerto
In reading about the recent outage at Delta, and seeing this article in which the CEO states that the vulnerability in their IT infrastructure was “a surprise to them”, it reminded me of a personal experience that sounds similar. It speaks to the fact that your understanding of how things actually are is only as valid as what you can test and confirm for yourself.
In 2010, I was working as a Solutions Engineer at a medium sized national Data Center and Cloud Services provider with locations in four major US cities. On Dec 23, 2010, at around 10 PM, we had a huge outage at the Irvine data center. About 100 colo clients were down, some small half cabinet customers with 3 servers, some large 50 cabinet clients with thousands of hosts/VMs, and all of them were extraordinarily angry – as you might imagine. It was all hands on deck just doing damage control, while the Ops staff worked to get power back online. The most commonly asked question during, and after the outage was “We had redundant power, how could we have a power outage”?
Here’s how:
A typical data center will have redundant power feeds – usually called A side (primary) and B side (secondary). But those power feeds are a chain of devices, from the smart power strip in the cabinet that they plug servers into, to the large PDU on the floor, to the switch gear that transfers power between the utility power, the giant UPS systems, the diesel generators and the utility company’s transformers. Any component on either the A side or the B side can fail, potentially bringing down that entire side. That’s why you have two redundant power feeds, right? In general the data center only sells about half of their available power on either side, so they can always fail over to the other side if needed and be within the maximum available power for that side.
Well guess what… at that data center, the entire power lineups were designed and installed all wrong. The data center had oversubscribed the power and both A side and B side were at +80% utilization. When a PDU on the A side failed, what should have happened was all those clients switched over to the B side immediately and kept running. But the B side was overloaded and blew a circuit breaker. So now both sides were down. Attempts to restart the B side failed because there was too much power trying to run through it (oversubscribed) and the circuit breaker kept popping. Clients were getting more and more furious by the minute as every attempt to bring power back up ran into problem after problem. It turns out the circuit breaker had a setting for how much power it took to trip it, and that was set wrong – too low. So that got fixed. But they were still oversubscribed and so not all clients could actually fail over to their B side, we had to leave some of them still off while we worked on the A side, the original power issue. Eventually around noon the next day, Christmas Eve no less, power was restored to both sides. However, they were still way over subscribed and there wasn’t enough spare power to re-distribute the load properly to have true redundancy.
Compounding the problem was the fact that many clients had their primary and secondary power strips fed both off the A side, or the B side exclusively so that even if the entire power string was working properly, if they failed over within their cabinet they would be on the same string – if that string went down they were completely down, no redundancy even though they specifically bought redundant power. Another problem was, numerous clients had only one power supply in their server, or only bought one feed (A side only) and connected that to both of their power supplies. It was a magnificent chain of errors across the board.
Clearly, the data center itself had not properly tested failing over from A side to B side power. Obviously, none of the clients had tested that either – All they would have needed to do was pull the plug on a server and see what happens – none of them had even done that much. They all, both the clients and the data center, assumed everything would work just fine. Much like the CEO of Delta, the vulnerability in their expensive co-location service provider offering was “a surprise to them” .
Message to our clients and prospects: Just because you think you have redundancy, test it anyway.
Just because you are in an expensive data center, doesn’t mean they are configuring their expensive power service properly. Just because you have two network connections, test them anyway. Pull that plug and see what actually happens. Just because you have DR, or High Availability, or Fault Tolerance, doesn’t mean it will actually work when you need it, unless you fully test it.
Footnote, the data center provider I worked for had purchased this Irvine data center from the previous owners in 2009 for around $20M. During the purchase, as much due diligence as possible was performed and the statements from the sellers that “we are fully redundant in our power delivery” was accepted with a routine inspection of the infrastructure. This outage happened in 2010. The resulting firestorm cost 3 employees their jobs (the ones who had promised that everything was redundant), and the ensuing repairs and upgrades to the power infrastructure to make it truly redundant cost $5M more, required two new FTEs dedicated to the project for two years, and caused planned outages for all those clients originally effected while all their power was re-designed and re-configured. It was rebuilding the airplane while it was flying – not fun for anyone involved. My employer inherited a ticking time bomb, it blew up, and it cost them a ton to repair all the damage. In the end, they trusted the sellers that the power was properly configured. It wasn’t. In the end, the customers trusted both the sellers and my employer, that the power was properly configured. It wasn’t. They both learned to test things the hard, painful, costly way. So did Delta.
Feel free to share this with anyone who thinks their DR or HA is “good enough”. Ask how they test it? All of it? Regularly?