On July 18, the general public had no idea what “CrowdStrike” was. That all changed when a small programming error in a mandatory, automatically distributed update to the company’s “Falcon” program crippled much of the computing infrastructure of the Western world, grounding flights in at least three states, knocking out 911 response centers, and causing major disruptions to businesses ranging from fast-food restaurants to Formula One teams. Such simultaneous national outages were completely unthinkable as recently as a decade ago, but are likely to occur more frequently in the future due to a confluence of individually innocuous but collectively deadly changes in our global technology infrastructure.
The first of these problems, and the immediate cause of CrowdStrike’s outage, has to do with how software is written in 2024. Historically, computer programs have been the work of small, dedicated teams who knew the product inside and out. Often, one person was responsible for the majority of the work. This was true of both the popular 1982 home video game River Raid, created by Activision employee Carol Shaw, and the powerful UNIX operating system, first created by AT&T’s Ken Thompson. Software created this way tended to be effective, efficient, and largely bug-free, which was important in an era without the possibility of remote software updates. It was also very hard to predict when it would be completed. The average computer programming project before the Internet was like a later Steely Dan record, with a few shadowy figures running the show, no accountability to management, and little incentive to follow anything but their own whims along the way.
Not anymore. Most software today is developed and released in two-week “sprints” by anonymous, interchangeable teams of hired, low-skilled programmers, most of whom are sourced overseas to receive the lowest bid. Each programmer is given a small portion of the overall task and rarely has a deep understanding, or even a desire, to understand how their contribution fits into the overall program. If there is a conflict between the work of two neighboring programmers, the conflict is automatically resolved by the tools they are using to do their work, but not always correctly.
From left, a blue Windows error message appears on a screen at a bus stop on July 22, 2024; a vehicle waiting at the US-Mexico border on July 19, 2024; and passengers wait in a long line at an airline counter in the Philippines on July 19, 2024. (Justin Sullivan/Getty, Christian Torres/Anadolu/Getty, Ezra Acayan/Getty)
This creates a culture in which offshore and H-1B programmers are viewed as disposable commodities while onshore managers are invaluable assets, promoting themselves as masters of “agile” or “scrum” methodologies that anonymize and dehumanize the people doing the actual work. The result is that offshore code farms are almost irresistible to America’s tech leaders, even when the promised cost savings from offshore code farms don’t materialize and the resulting products are substandard. These days, that’s almost always the case, from “this new phone is slower than my old phone, even though it’s more powerful” to “this plane seems to be falling out of the sky more often than we’d like.”
Of course, no amount of incompetent software can do any harm if it is not installed on your computer or if you have the opportunity to evaluate it on a test system before installing it. In the past, most major systems were operated by experienced personnel who had the final say on what went on “their” computers. It was common to test software patches and updates on a few systems before releasing them to the whole company. This did not happen with CrowdStrike updates, because the Falcon program, which is supposed to protect computers from criminal hacking and external attacks, has privileges that override those of system administrators. The Falcon program could install its own updates from CrowdStrike at any time, without the computer owner’s consent. And it was installed almost everywhere at the same time. And so the domino effect began.
This “absolute authority” is a non-negotiable part of using CrowdStrike software. Because clients are not allowed to exercise their own control or care over the process, they are at the complete mercy of a company willing to remotely install unproven and highly harmful updates onto their servers, apparently without prior notice. However, most of their clients would have been safe from this combination of carelessness (on CrowdStrike’s part) and helplessness (on the client’s part) if CrowdStrike had only followed the most basic safety policies during the update process.
People wait to check in at Rome Fiumicino International Airport on July 19, 2024. (Riccardo De Luca/Anadolu/Getty)
It took the company just hours to understand the problem and provide a fix. Had CrowdStrike rolled out this non-critical update in stages, as is traditional across the industry, it could have fixed the issue before a significant percentage of its customers were affected. Instead, it sent it to everyone at the same time. There is little explanation for this other than the ubiquitous “we know what’s best for you” tech company arrogance that’s so prevalent in things like the lack of a “back” button on the iPhone and the common belief that discontinued annual license programs like McAfee Antivirus have an inherent right to “pop up” on your screen forever like an electronic Barbary pirate and demand additional fees.
But the combination of carelessness, absolute power over consumers, and even astonishing corporate narcissism described above is nothing new to American consumers. That’s why it took more than a dozen fatal accidents for General Motors’ “X-car” to be recalled for a rear brake repair. But this famous problem only affected a tiny fraction of the cars sold in showrooms at the time. Buyers of Ford Fairmonts needn’t have worried. (The same was true for Chevrolet Vega buyers when the Ford Pinto was recalled for a fuel-system-related fire.) The auto industry is inherently competitive. There are numerous manufacturers eager to offer your next car.
The same has been true for computing for a long time. At the turn of the century, there were as many as a dozen vendors of server operating systems, and multiple providers of almost every type of software you could imagine. Over the past two decades, the diversity of this environment has diminished at the rate of a Brazilian rainforest. Today, the majority of servers are either Microsoft Windows, which was affected by the outages, or one of a few flavors of Linux, which was not affected.
An IT outage message at Gatwick Airport in London, July 19, 2024. (Jack Taylor/Getty)
We are now dangerously close to a “monoculture” in many aspects of technology. The vast majority of cloud servers are run by Amazon, so when there is an outage like the one that occurred in the Amazon Web Services “US East” region on December 7, 2021, the impact is immediate and widespread. The combination of Windows Server and CrowdStrike Falcon is commonly used by over half of the Fortune 500 companies, so when CrowdStrike sneezes, the entire business world catches a cold.
The greatest irony is that these single points of failure are often the direct result of policies intended to make services more stable and available. A tenet of modern computing, “site reliability engineering,” calls for as many identical servers and software builds as possible; this is supposed to make them easier to maintain and sustain. In reality, it results in the entire infrastructure relying on a single piece of software that often has the disproportionate power to bring it all down.
How did the monoculture come about? You may remember the old phrase, “Nobody got fired for buying IBM.” The tech industry, famously anti-competitive, has extended this mindset to nearly every level of software and computing through a series of technology alliances and deliberate incompatibilities. CrowdStrike is a partner with Amazon Web Services, a partner with Dell, and a partner with Netskope. Buying one product in the stack encourages you to buy the others, too. So most tech leaders just do what’s easiest.
Often this means abandoning common sense altogether. For example, the Okta platform puts all of a company’s authentication in the hands of a third party, while CyberArk happily stores all of its passwords. Today, it’s completely normal for Fortune 500 companies to put all passwords, permissions, and authentication in the hands of a third party, while at the same time using complex policies and procedures to limit the power of their own technical support and system administration staff. When these third-party authentication and password providers are compromised, they are often unwilling or hesitant to disclose the problem to the customers they are supposed to protect, as was the case for both Okta and password “vault” LastPass in 2022. What did most LastPass customers do when they were betrayed? Most of them migrated en masse to Keeper or 1Password. It’s like handing your wallet to a stranger on the subway, watching that person run off with your wallet, and concluding that you made a mistake by handing your wallet to the wrong person.
To read more from the Washington Examiner, click here
Given all of this, the only surprising thing about CrowdStrike’s problems is that it took so long to happen on this scale. It will almost certainly happen again in another monocultural “choke point” and will continue to happen until companies learn the right lessons as a result. Doing software development in the US with your own employees will solve a lot of the problems. Refusing to work with software providers that insist on control of the system will take care of most of the remaining problems. A little focus on diversity never hurts. In this case, we’re talking about “diversity in computing infrastructure.”
Sure, CrowdStrike is to blame for this outage. It’s like saying the Challenger disaster was an O-ring problem. That doesn’t convey the flaws in the systems that caused it. In this case, the lesson should be clear to all of America’s technology leaders. Most of them won’t learn it or take even the slightest steps to prevent the next one. After all, this outage is fixed. It’s in the past. There’s just one small problem: history is almost certain to repeat itself again and again.
Jack Balse was born in Brooklyn, New York and currently lives in Ohio. He is a professional-amateur race car driver, a former columnist for Road and Track and Hagerty magazines, and author of the Avoidable Contact Forever newsletter.