A Time to Remember
I want you to think back to a time when you found yourself in an emergency situation at work.
Maybe it was diagnosing and trying to recover from a site outage.
Maybe it was when you were confronting the uncertain possibility of critical data loss.
Maybe it was when you and your team were responding to a targeted and malicious attack.
Maybe it was a time when you, maybe even milliseconds after you triggered some action (maybe even just hit “enter” after a command), realized that you just made a terrible mistake and inadvertently kicked off irreparable destruction that cannot be undone.
Maybe it was a shocking discovery that something bad (silent data corruption, for example) has been happening for a long time and no one knew it was happening.
Maybe it was a time when silence descended upon your team as they tried to understand what was happening to the site, and the business. The time when you didn’t even know what was going on, forget about hypothesizing how to go about fixing it.
Think back to the time when you had to actively hold back the fears of what the news headlines or management or your board of directors were going to say about it when it was over, because you have a job to do and worrying about those things wouldn’t bring the site back up.
Think back to a time when after you’ve resolved an outage and the dust has settled, your adrenaline turns its focus to amplifying the fear that you and your team will have no idea when that will happen again in the future because you’re still uncertain how it happened in the first place.
I’ve been working in web operations for over 15 years, and I can describe in excruciating detail examples of many of those situations. Many of my peers can tell stories like those, and often do. I’m willing to bet that you too, dear reader, will find these to be familiar feelings.
Those moments are real.
The cortisol coursing through your body during those times was real.
The effect of time pressure is real.
The problems that show up when you have to make critical decisions based on incredibly incomplete information is real.
The issues that show up when having to communicate effectively across multiple team members, sometimes separated by time (as people enter and exit the response to an outage) as well as distance (connected through chat or audio/video conferencing) are all real.
The issues when coordinating who is going to do what, and when they’re going to do it, and confirming that whatever they did went well enough for someone else to do their part next, etc. are all real.
And they all are real regardless of the outcomes of the scenarios.
Comparisons
Those moments do happen in other domains. Other domains like healthcare, where nurses work in neonatal intensive care units.
Like infantry in battle.
Like ground control in a mission control organization.
Like a regional railway control center.
Like a trauma surgeon in an operating room.
Like an air traffic controller.
Like a pilot, just flying.
Like a wildland firefighting hotshow crew.
Like a ship crew.
Like a software engineer working in a high-frequency trading company.
All of those domains (and many others) have these in common:
- They need to make decisions and take action under time pressure and with incomplete information, and when the results have just as much potential to make things worse than they do to make things better.
- They have to communicate a lot of information and coordinate actions between teams and team members in the shortest time possible, while also not missing critical details.
- They all work in areas where small changes can bring out large results whose potential for surprising everyone is quite high.
- They all work in organizations whose cultural, social, hierarchical, and decision-making norms are influenced by past successes and failures, many of which manifest in these high-tempo scenarios.
But: do the people in those domains experience those moments differently?
In other words: does a nurse or air traffic controller’s experience in those real moments differ from ours, because lives are at stake?
Do they experience more stress? Different stress?
Do they navigate alerts, alarms, and computers in more prudent or better ways than we do?
Do they have more problems with communications and coordinating response amongst multiple team members?
Are they measurably more careful in their work because of the stakes?
Are all of their decisions perfectly clear, or do they have to navigate ambiguity sometimes, just like we do?
Because there are lives to protect, is their decision-making in high-tempo scenarios different? Better?
My assertion is that high-tempo/high-consequence scenarios in the domain of Internet engineering and operations do indeed have similarities with those other domains, and that understanding all of those dynamics, pitfalls, learning opportunities, etc. is critical for the future.
All of the future.
Do these scenarios yield the same results, organizationally, in those domains as they do in web engineering and operations? Likely not. But I will add that unless we attempt to understand those similarities and differences, we’re not going to know what to learn from, and what to discard.
Hrm. Really?
Because how can we compare something like the Site Reliability Engineer team’s experience at Google.com to something like the air traffic control crew experience landing airplanes at Heathrow?
I have two responses to this question.
The first is that we’re assuming that the potential severity of the consequence influences the way people (and teams of people) think, act, and behave under those conditions. Research on how people behave under uncertain conditions and escalating scenarios do indeed have generalizable findings across many domains.
The second is that in trivializing the comparison to loss of life versus non-loss of life, we can underestimate the n-order effects that the Internet can have on geopolitical, economic, and other areas that are further away from servers and network cables. We would be too reductionist in our thinking. The Internet is not just about photos of cats. It bolsters elections in emerging democracies, revolutions, and a whole host of other things that prove to be life-critical.
A View From Not Too Far Away
At the Velocity Conference in 2012, Dr. Richard Cook (an anesthesiologist and one of the most forward-thinking men I know in these areas), was interviewed after his keynote by Mac Slocum, from O’Reilly.
Mac, hoping to contrast Cook’s normal audience to that of Velocity’s, asked about whether or not he saw crossover from the “safety-critical” domains to that of web operations:
Cook: “Anytime you find a world in which you have high consequences, high tempo, time pressure, and lots of complexity, semantic complexity, underlying deep complexity, and people are called upon to manage that you’re going to have these kinds of issues arise. And the general model that we have is one for systems, not for specific instances of systems. So I kind of expected that it would work…”
Mac: ”…obviously failure in the health care world is different than failure in the [web operations] world. What is the right way to address failure, the appropriate way to address failure? Because obviously you shouldn’t have people in this space who are assigning the same level of importance to failure as you would?”
Cook: “You really think so?”
Mac: “Well, if a computer goes down, that’s one thing.”
Cook: “If you lose $300 to $400 million dollars, don’t you think that would buy a lot of vaccines?”
Mac: “[laughs] well, that’s true.”
Cook: “Look, the fact that it appears to be dramatic because we’re in the operating room or the intensive care unit doesn’t change the importance of what people are doing. That’s a consequence of being close to and seeing things in a dramatic fashion. But what’s happening here? This is the lifeblood of commerce. This is the core of the economic engine that we’re now experiencing. You think that’s not important?”
Mac: “So it’s ok then, to assign deep importance to this work?”
Cook: “Yeah, I think the big question will be whether or not we are actually able to conclude the healthcare importance measures up to the importance web ops, not the other way around.”
Richard further mentioned in his keynote last year at New York’s Velocity that:
“…web applications have a tendency to become business critical applications, and business-critical applications have a tendency to become safety-critical systems.”
And yes, software bugs have killed people.
When I began my studies at Lund University, I was joined by practitioners in many of those domains: air traffic control, aviation, wildland fire, child welfare services, mining, oil and gas industry, submarine safety, and maritime accident investigation.
I will admit at the first learning lab of my course, I mentioned that I felt like a bit of an outsider (or at least a cheater in getting away with failures that don’t kill people) and one of my classmates responded:
“John, why do you think that understanding these scenarios and potentially improving upon them has anything to do with body count? Do you think that our organizations are influenced more by body count than commercial and economic influences? Complex failures don’t care about how many dollars or bodies you will lose — they are equal opportunists.”
I now understand this.
So don’t be fooled into thinking that those human moments at the beginning of this post are any different in other domains, or that our responsibility to understand complex system failures is less important in web engineering and operations than it is elsewhere.
I’ve reached similar conclusions about the similarity between protecting information systems infrastructure and the work of other high-reliability organizations. I’d be happy to send you a briefing for the CIO of U.S. Department of the Navy or a list of references if you’d find that useful.
Hey Con: Thanks for the comment! Yes, I’d love to see that! My email address is my last name at Google mail’s domain name. 🙂
Pingback: SRE Weekly Issue #34 – SRE WEEKLY