Salesforce Outage: One More Nail in the Coffin of “I can trust my SaaS provider…”
By Keith Taylor, Product Specialist at Zerto
Rarely does an outage rocket to infamy like the SalesForce #NA14 incident. But when your service is relied upon by millions of users around the globe, there’s just no way to escape scrutiny when one of your data centers goes down— let alone it being offline for almost 20 hours and you lose nearly 4 hours of your customer’s data!
I’m sure by now we all know the background to the story. But just in case you’ve had your head in the clouds (pun absolutely intended), we’ll recap the issues that led to the generation of countless articles, memes and even t-shirts.
On May 9th at around 5:47pm (PDT), a power failure servicing a portion of SalesForce’s Washington data center occurred and the #NA14 instance became inaccessible. The team established the cause was the failure of a circuit breaker (despite it having passed load tests in March 2016).
The decision was made to perform a site switch and failover to another data center in Chicago. This was done and the “all clear” was given at 7:39pm the same day. Not long after this however, engineers began to notice performance degradation which swiftly escalated to full-scale service disruption. It was later discovered that this was due to a firmware bug in the storage array which resulted in a significant increase in write times, which in turn led to the database beginning to time out – eventually a single write did not complete which resulted in file discrepancies and a database cluster failure.
This was where the real problems began – the file discrepancies responsible for the database failure had been replicated back to the Washington data center while at the same time, the backup in Chicago had not completed in time and, as such, could not be used as a restore point either. After multiple attempts to restore the service in Chicago failed, SalesForce teams decided to use the backup from Washington to restore from instead. Unfortunately, this had not been updated after the site switch to Chicago, , so it resulted in the loss of all customer data entered between 2:53am and 6:29am (PDT) on 10th May.
There is no Cloud
The #NA14 incident brings front and center the concerns many businesses have about leveraging the benefits of the cloud and SaaS models. What we seem to forget all too easily is that the cloud is, in its simplest form, just someone else’s computer.
© https://www.chriswatterston.com/
When we make use of cloud services, we are essentially relinquishing control of our own platforms, services, data and security, and handing responsibility for all these things to someone who doesn’t necessarily have to deal with the fallout when things go TITSUP (Total Inability To Support Usual Performance).
Being reliant upon a third party for the provision of mission-critical services can potentially be a bit of a gamble. There are a number of risks to consider when we look to utilize SaaS platforms but by putting in place proper contracts and controls, policies and procedures, we should be able to manage them.
Be sure that your provider’s operational resiliency and reliability measures are not only adequate, but also align with your own high standards. Responsibility for security measures, backup and/or disaster recovery processes, hardware quality and the application of software updates – all these things need to be assessed for potential risks and guaranteed in some manner.
Many organizations, particularly those in Healthcare, Finance and Insurance or the Public Sector, will have legal requirements to ensure that their service providers have suitably strong control measures in place to protect and ensure the availability of their customer’s data. There may also be geographical restrictions in some countries which control where the data can be held or transferred.
The viability of the vendor themselves as an operational business entity also needs to be considered – financial risks, political problems and even the potential for criminal proceedings to be brought (however unlikely) — should all be taken into consideration. Should your SaaS provider fail entirely in any one of these regards, they may allow you to get your data back depending on your contractual agreements. But even then, the business will still suffer the downtime of transferring that data to an alternative infrastructure.
These are just some of the potential risks you may face with any SaaS provider. But the point here is simply to get you thinking – sometimes when we don’t look after something directly, we forget to take into full consideration the robustness of the underlying operations. Sometimes – and this is arguably worse – we assume just because of the size or scale of the business that we’re safe.
Lessons Learned
During the financial crisis of 2008 onwards, the term ‘too big to fail’ became synonymous with the banking industry. A number of financial institutions demonstrated all-too-well that this simply isn’t true. In fact, the bigger you are, the harder you fall. We, as both businesses and consumers, placed an inordinate amount of trust in these institutions that they would behave and operate in a manner that was conducive to the benefit of their customers, rather than for their own profitability.
The question now becomes – are we making sure that we’ve learned our lessons and are asking enough of our SaaS providers too?