Let’s play the blame game
Blame it on Cain
Don't blame it on me
Oh, oh, oh, it's nobody's fault
But we need somebody to burn
~Elvis Costello, Blame it on Cain
You might have heard that large swaths of the internet went down on Monday. The culprit was a defect in the DNS automation system affecting DynamoDB endpoints at Amazon’s US‑EAST‑1 data center. Because DNS translates web addresses into server locations, this failure prevented software from finding the addresses of AWS’s database servers used by thousands of apps. If DNS can’t do its job…well… as we learned, sites go crashing down. It wasn’t quite as bad as CrowdStrike’s faulty update last year that caused worldwide IT chaos, but it was pretty terrible nonetheless.
Bad enough that I reckon that AWS SREs (Site Reliability Engineers), who are charged with keeping the system up, while scrambling to resolve the problem, were playing the blame game, at least in their minds. Andy Jassy and the rest of the C-Suite must have been seething. An outage like this is embarrassing. An outage that lasted almost 15 brutal hours is devastating.
It affected airlines — and might have been the cause of my extensive travel delays on Monday that resulted in me canceling my trip — along with sites like Snapchat, Reddit, Zoom and Disney+, as well as banks and a whole host of other sites we probably never even heard about including a slew of startups. Through a quick informal survey of boldstart’s portfolio companies, I learned that anyone on AWS was impacted at least to some extent.
So who’s to blame here? When systems fail, there is usually a post-mortem, and AWS has already explained the root cause and their prevention plan. Instead of pointing fingers, learning and understanding and trying to prevent similar incidents is probably the best way forward for both customers and Amazon.
Stuff’s gonna happen
As any IT pro can tell you, when you’re running a complex system, stuff happens, and probably more often than you think. The difference is that these days, instead of bringing down your company’s email server, it’s bringing down a good percentage of the internet. It shows the interconnectedness of today’s systems, and the nature of dependence on centralized cloud vendors.
Back in 2017, AWS had another outage at US‑EAST‑1. This one involved S3 storage, and like this week’s incident, it caused a number of sites to go down. The difference was that outage was resolved in four hours instead of fifteen —a much faster outcome, even though in the moment it felt like a big deal.

In a TechCrunch article I wrote at the time, Ben Kepes, a cloud computing analyst and commentator, explained that it was part and parcel of buying a complex set of services. These types of incidents inevitably happen, and what he said 8 years ago is just as applicable today.
“If anything, the outage showed just how many third parties rely on AWS for their infrastructure. The reality, as unpalatable as it sounds, is that failures happen from time to time and organizations need to plan for that failure,” he said.
Preventing future incidents instead of placing blame
AWS took a number of steps to prevent this from happening again, which you can read about in the company blog post if you want to get down deep in the technical weeds, but suffice to say, they are trying to build protections into DNS automation system to keep this from happening again.
People who deal with disasters for a living don’t find playing the blame game particularly useful. They’d rather discover the causes and take actions that can, to the extent possible, prevent lightning from striking twice. Nora Jones, founder of Jeli (an incident management startup acquired by PagerDuty in 2023, which was part of the boldstart porfolio) shares this philosophy. In a 2022 TechCrunch interview, Jones said Jeli’s approach isn’t about assigning blame or firing people, which she sees as counterproductive: “Instead of asking ‘Who did this?’ our platform asks, ‘How was this possible?’ Because if one person could make that mistake, another likely will too,” Jones explained.

But there's another side to this story regarding what customers can realistically do to blunt the impact of an outage like this. There were sites that stayed up during the incident. The question is what did they do differently? Adrian Cockcroft, an IT veteran and former AWS employee, says the difference was that those companies had a working fail-over plan. That means they had set up a way for the service to fail over to another Amazon region as soon as their primary one ran into trouble.
“It’s possible to keep running when a cloud region goes down. As far as I can tell Netflix and Capital One stayed up through the outage(s) today,” Cockcroft wrote in a LinkedIn post.
One thing we can be sure of is that no matter how well we plan, there will always be a unique set of circumstances, or even just plain human error that can bring systems down. When that system happens to run a good portion of the internet, it makes it all the worse. But we can look for somebody to burn, or can figure out what happened and try to prevent it from happening again. That seems like a more reasonable course of action to me.
~Ron
Featured photo by Giulia May on Unsplash