Widespread IT outages on July 19, caused by a buggy update from cybersecurity company CrowdStrike, created flight delays, news broadcaster disruptions and blue screens of death on PCs of all types of companies. Banks dealt with their own share of issues. While the effects were "scattered," according to a spokesperson for the Financial Services Information Sharing and Analysis Center,
The aftermath has been painful for CrowdStrike: Lawyers have begun courting litigants for a class action against the company, Delta passengers have filed a lawsuit against the airline over its meltdown and the airline has publicly traded barbs with CrowdStrike over who's responsible.
Meanwhile, CrowdStrike has publicly released increasingly detailed accounts of what caused the Channel File 291 fiasco — named for the specific file that included a misconfiguration that caused millions of Windows systems to crash. Last week, the company released its so-called "
Among the findings was new information about the logic error that caused the crashes. According to the analysis, the buggy update to Channel File 291 was the first one to include 21 parameters, or settings that configure how the threat detection software works. While the Windows software was meant to handle 21 parameters, it was erroneously programmed to only handle 20 — seemingly because use of the 21st parameter was so rare.
While this scenario with Channel File 291 is now incapable of recurring, it also informs process improvements and mitigation steps that CrowdStrike is deploying to ensure further enhanced resilience," reads CrowdStrike's analysis.
Experts have issued their own reports on the matter, including the consulting group Forrester, which offers lessons from the incident beyond merely what caused it, including an examination of what companies need to do to ensure they can recover from IT disasters.
Here are two of the top findings from the retrospectives from CrowdStrike and Forrester.
Carefully consider data protection versus recovery
One of the most important exacerbating factors of the July 19 incident was that, when IT operatives went to restore their crashed Windows computers, they were faced with a prompt for their BitLocker recovery key and a disturbing realization that they were not sure where those keys were, or that the system that held the keys had also crashed.
As Forrester analysts
BitLocker is a Microsoft Windows feature that provides encryption for the computer to mitigate the threats of data theft from lost, stolen or inappropriately decommissioned devices. As with many forms of encryption, BitLocker adds protection but creates a new problem around what to do with the key, much like figuring out what to do with a newly created password.
In the case of the July 19 outages, CrowdStrike
In other words, BitLocker might not serve the protective function that users might hope it would.
Beyond that, the July 19 incident also demonstrated that the feature can severely hamper recovery efforts. Even without this extra barrier, recovery is not always as easy as it seems because it can be a highly manual effort, according to Forrester's report.
"Organizations learned with ransomware events that simply 'restoring backups' sounded great but in practice took far more time than expected," the report reads. "This event taught them the same lessons and required hands-on keyboard intervention in many cases."
Forrester's report did not offer a prescription for whether to implement BitLocker, opting instead to advise companies to weigh the data protection advantages of the feature against its limitations and negative impact on recovery times. Automated recovery and fail-over (spinning up backup systems when the main systems fail) can help improve resilience against events such as the CrowdStrike incident.
Exercise control over automated updates
One of the key changes CrowdStrike has made in the wake of July 19 is that customers can now set when and where so-called "Rapid Response Content" updates get deployed. In other words, customers can now set a time delay on how quickly their systems are updated.
Rapid Response Content is a type of content that allows CrowdStrike to quickly update customers' computers with new threat intelligence, to allow the computers to quickly respond to novel threats. It is one of the fundamental differentiating features for CrowdStrike, enabling the company to shut down new cyber threats across all its customers within hours.
The feature, however, also enabled the company to accidentally crash many of its customers' computers.
Ideally, firms would be able to trust that every automatic update it receives is sound, enabling the company to protect itself against novel threats without delay. But, because problems sometimes arise in automatic updates, adding a bit of delay can provide the company time to see how the automatic update is affecting others.
The cybersecurity company promised more rigorous tests for software updates after a widespread IT outage last month.
CrowdStrike customers now have the ability to add just such a delay to the company's updates. The company said in its root cause analysis that one change it has implemented is enabling customers to choose where and how quickly their systems are protected from novel cyber threats — and the threat of a buggy update.
Even so, Forrester analysts implore CrowdStrike users — and all users of software that receives automatic updates — to tread carefully when limiting these updates.
"Auto-updates have been a thing for years," the Forrester report reads. "They have saved the industry significant administrative time and therefore money."
While the analysts acknowledged that an incident like the one on July 19 might recur, the probability of any given automatic update triggering such an event is low.
"You need to control for it, but don't overdo it," the report advises.'