CrowdStrike Outage: A Stark Reminder about IT Updates
If you are reading this, you probably know what happened, but just in case you don’t, let me put it simply: CrowdStrike issues a software update which caused certain Windows machines to experience a blue screen of death (BSOD) and fail to boot. This caused a major outage in many sectors, including transportation, and CrowdStrike customers scrambled to roll back their systems or implement a workaround to restore systems to working order.
Where Does the Responsibility Lie?
Instead of trying to second guess what happened and cast blame on top of an already tough situation for CrowdStrike and their customers, I’d rather use this moment as a reminder that we all have a responsibility to prepare for events like this that will eventually happen. In a perfect world, we would never allow these types of mistakes. We know, though, that IT is far from perfect, and we all have a responsibility not just in trying to prevent these occurrences, but in mitigating the damage from them.
Testing Updates is a Two-Way Street
The software developers that issue updates are not the only ones responsible for testing software; the IT pros who update systems are also responsible for testing said updates before rolling them out to production. The software developer cannot predict the unique and specific combination of software elements you use in your IT environment. Nor could they test all the possible system configurations even if they wanted to. Of course, in the case of this CrowdStrike outage, this was not much of a factor as it affected a basic Windows configuration that allowed it to take down so many systems.
Whether for workstations or servers, testing every software update is a key step to avoiding disruptions. There is always an opportunity for a conflict between two pieces of software running on the same system. For anyone rolling out updates to their own systems, these should be tested first in an isolated environment that is as close to the production environment as possible. This is not always easy, though, as often not all systems or workstations within an organization have the same software loadout.
In virtualized environments, both on-premises and in the cloud, it is possible to recreate production environments quickly and accurately. Using backups or recovery data replicated in real-time, workloads can be quickly recreated—or cloned— in virtual environments with the current production configurations and profiles while isolated for testing. Using such isolated replica environments for tests provides greater testing validation and assurance. Not everything can be replicated exactly in test environments, such as precise performance loads, but it is as close as possible in an isolated test environment.
Such an isolated environment not only makes it possible to test, but to do so without impacting production environments. Isolated networking, storage, and compute resources have no disruption or impact on production environments. Using this method, testing can be done at any time, whether it be during primary business hours or after hours, and with neither users nor customers being impacted. There really should be no excuse for why software updates cannot be tested prior to roll out within an organization.
The Ability to Roll Back
No testing system is 100% effective. Inevitably, some unforeseen problem will arise due to a software update or configuration change and a disruption will occur. The question is how to get back to the point before the disruption. Of course, this depends on what caused the disruption, as it could be fixed by simply toggling a setting back somewhere, but it may not be so simple. It may require restoring a system through blunt-force disaster recovery. This is what would qualify as a disaster on some scale, with maybe hundreds or thousands of users or customers being unable to access important systems.
The CrowdStrike outage directly affected thousands of workstations and each one needed to be remediated individually, which is very unfortunate and time-consuming. Thousands of users and customers can similarly be affected by a single server disruption, in which case it would require that only that specific server would need to be remediated. Because every incident may differ, you should be prepared to roll back anything from a configuration setting change to one or more production workloads from recovery data. Some rare disruptions from unforeseen updates could be so severe that they require a full disaster declaration and proceed to utilizing failover to a DR site.
Unfortunately rolling back a workload to a previous point in time can be painful, especially if that rollback loses hours or days’ worth of data. Luckily, with continuous data protection technologies, recovery points can be available within seconds of when an update is made. When using traditional backup and snapshot technologies, it is advised that you take a backup or snapshot before making system updates to ensure a more recent recovery point is available.
What About the Responsibility of the Software Vendor?
Of course, the software vendor has a responsibility to not deliver an update that is going to disrupt their customers’ systems. We are all imperfect humans and mistakes happen, and people should be held accountable for those mistakes. As I stated earlier, though, a software vendor cannot test for every type of environment. There are plenty of situations beyond this CrowdStrike outage where updates cause unforeseen problems specific to unique IT environments with combinations of settings and software that the vendor could not be aware of. This is why testing is important on both ends of the delivery process so that you can know specifically how the software update will affect your environment.
If you are not already, utilize a full range of management and recovery tools that assist you in testing quickly and easily and in rolling back if the rollout does not go as planned. Use recovery solutions that offer non-disruptive testing environments and roll back to seconds before a disruption occurs. Make testing a priority for all software updates prior to rolling out to production. We cannot prevent 100% of outages, but collectively we can prevent most and mitigate the impact of those that occur.
Learn how you can easily create high-fidelity replicas/clones of your production environment with Zerto. Also, check out our solutions’s non-disruptive testing and point-in-time recovery options.
And, if you are ready, try Zerto for yourself in one of our on-demand, hands-on labs, or get in touch to discuss your needs further.