Site icon Kitchen Soap

Resilience Engineering Part II: Lenses

(this is part 2 of a series: here is part 1)

One of the challenges of building and operating complex systems is that it’s difficult to talk about one facet or component of them without bleeding the conversation into other related concerns. That’s the funky thing about complex systems and systems thinking: components come together to behave in different (sometimes surprising) ways that they never would on their own, in isolation. Everything always connects to everything else, so it’s always tempting to see connections and want to include them in discussion. I suspect James Urquhart feels the same.

So one helpful bit (I’ve found) is Erik Hollnagel’s Four Cornerstones of Resilience. I’ve been using it as a lens with which which to discuss and organize my thoughts on…well, everything. I think I’ve included them in every talk, presentation, or maybe even every conversation I’ve had with anyone for the last year and a half. I’m planning on annoying every audience for the foreseeable future by including them, because I can’t seem to shake how helpful they are as a lens into viewing resilience as a concept.

Four Cornerstones of Resilience

The greatest part about the four cornerstones is that it’s a simplification device for discussion. And simplifications don’t come easily when talking about complex systems. The other bit that I like is that it makes it straightforward to see relationships in activities and challenges in each of them, as they relate to each other.

For example: learning is traditionally punctuated by Post-Mortems, and what (hopefully) come out of PMs? Remediation items. Tasks and facts that can aid:

I’ll leave it as an exercise to imagine how anticipation can then affect monitoring, response, and learning. The point here is that each of the four cornerstones can effect each other in all sorts of ways, and those relationships ought to be explored.

I do think it’s helpful when looking at these pieces to understand that they can exist on a large time window as well as a small one. You might be tempted to view them in the context of infrastructure attributes in outage scenarios; this would be a mistake, because it narrows the perspective to what you can immediately influence. Instead, I think there’s a lot of value in looking at the cornerstones as a lens on the larger picture.

Of course going into each one of these in detail isn’t going to happen in a single epic too-long blog post, but I thought I’d mention a couple of things that I currently think of when I’ve got this perspective in mind.

Anticipation

This is knowing what to expect, and dealing with the potential for fundamental surprises in systems. This involves what Westrum and Adamski called “Requisite Imagination”, the ability to explore the possibilities of failure (and success!) in a given system. The process of imagining scenarios in the future is a worthwhile one, and I certainly think a fun one. The skill of anticipation is one area where engineers can illustrate just how creative they are. Cheers to the engineers who can envision multiple futures and sort them based on likelihood. Drinks all around to those engineers who can take that further and explain the rationale for their sorted likelihood ratings. Whereas monitoring deals in the now, anticipation deals in the future.

At Etsy we have a couple of tools that help with anticipation:

But anticipation isn’t just about thinking along the lines of “what could possibly go wrong?” (although that is always a decent start). It’s also about the organization, and how a team behaves when interacting with the machines. Recognizing when your adaptive capacity is failing is key to anticipation. David Woods has collected some patterns of anticipation worth exploring, many of which relate to a system’s adaptive capacity:

I’m fascinated by the skill of anticipation, frankly. I spoke at Velocity Europe in Berlin last year on the topic.

Monitoring

This is knowing what to look for, and dealing with the critical in systems. Not just the mechanics of servers and networks and applications, but monitoring in the organizational sense. Anomaly detection and metrics collection and alerting are obviously part of this, and should be familiar to anyone expecting their web application to be operable.

But in addition to this, we’re talking as well about meta-metrics on the operations and activities of both infrastructure and staff.

Things like:

Response

This is knowing what to do, and dealing with the actual in systems. Whether you’ve anticipated a perturbation or disturbance, as long as you can detect it, than you have something to respond to. How do you? Page the on-call engineer? Are you the on-call engineer? Response is fundamental to working in web operations, and differential diagnosis is just as applicable to troubleshooting complex systems as it is

Pitfalls in responding to surprising behaviors in complex systems have exotic and novel characteristics. They are the things that what make Post-Mortem meetings dramatic; the can often include stories of surprising turns of attention, focus, and results that makes troubleshooting more of a mystery than anything. Dietrich Dörner, in his 1980 article “On The Difficulties People Have In Dealing With Complexity“, he gave some characteristics of response in escalating scenarios. These might sound familiar to anyone who has experienced team troubleshooting during an outage:

…[people] tend to neglect how processes develop over time (awareness of rates) versus assessing how things are in the moment.
…[people] have difficulty in dealing with exponential developments (hard to imagine how fast things can change, or accelerate)
…[people] tend to think in causal series (A, therefore B), as opposed to causal nets (A, therefore B and C, therefore D and E, etc.)

I was lucky enough to talk a bit more in detail about Resilient Response In Complex Systems at QCon in London this past year.

Learning

This is knowing what has happened, and dealing with the factual in systems.  Everyone wants to believe that their team or group or company has the ability to learn, right? A hallmark of good engineering is empirical observation that results in future behavior changes. Like I mentioned above, this is the place where Post-Mortems usually come into play. At this point I think our field ought to be familiar with Post-Mortem meetings and the general structure and goal of them: to glean as much information about an incident, an outage, a surprising result, a mistake, etc. and spread those observations far and wide within the organization in order to prevent them from happening in the future.

I’m obviously a huge fan of Post-Mortems and what they can do to improve an organization’s behavior and performance. But a lesser-known tool for learning is the “Near-Miss” opportunities we see in normal, everyday work. An engineer performs an action, and realizes later that it was wrong or somehow produced a result that is surprising. When those happen, we can hold them up high, for all to see and learn from. Did they cause damage? No, that’s why they “missed.”

One of the godfathers of cognitive engineering, James Reason, said that “near-miss” events are excellent learning opportunities for organizations, because they:

  1. Can act like safety “vaccines” for an organization, because they are just a little bit of failure that doesn’t really hurt.
  2. They happen much more often than actual systemic failures, so they provide a lot more data on latent failures.
  3. They are a powerful reminder of hazards, therefore keeping the “constant sense of unease” that is needed to provide resilience in a system.

I’ll add that encouraging engineers to share the details of their near-misses has a positive side effect on the culture of the organization. At Etsy, you will see (from time to time) an email to the whole team from an engineer that has the form:

Dear Everybody,

This morning I went to do X, so I did Y. Boy was that a bad idea! Not only did it not do what I thought it was going to, but also it almost brought the site down because of Z, which was a surprise to me. So whatever you do, don’t make the same mistake I did. In the meantime, I’m going to see what I can do to prevent that from happening.

Love,

Joe Engineer

For one, it provides the confirmation that anyone, at any time, no matter their seniority level, can make a mistake or act on faulty assumptions. The other benefit is that it sends the message that admitting to making a mistake is acceptable and encouraged, and that people should feel safe in admitting to these sometimes embarrassing events.

This last point is so powerful that it’s hard to emphasize it more. It’s related to encouraging a Just Culture, something that I wrote about recently over at Code As Craft, Etsy’s Engineering blog.

The last bit I wanted to mention about learning is purposefully not incident-related. One of the basic tenets of Resilience Engineering is that safety is not the absence of incidents an failures; it’s the presence of actions, behaviors, and culture (all along the lines of the four cornerstones above) that causes an organization to be safe. Learning from failures means that the surface area to learn from is not all that large. To be clear, most organizations see successes much much more often than they do failures.

One such focus might be changes. If, across 100 production deploys, you had 9 change-related outages, which should you learn from? Should you be satisfied to look at those nine, have postmortems, and then move forward, safe in the idea that you’ve rid yourself of the badness? Or should you also look at the 91 deploys, and gather some hypothesis about why they ended up ok? You can learn from 9 events, or 91. The argument here is that you’ll be safer by learning from both.

So in addition to learning from why things go wrong, we ought to learn just as much from why do things go right? Why didn’t  your site go down today? Why aren’t you having an outage right now? This is likely due to a huge number of influences and reasons, all worth exploring.

….

Ok, so I lied. I didn’t expect this to be such a long post. I do find the four cornerstones to be a good lens with which to think and speak about resilience and complex systems. It’s as much of my vocabulary now as OODA and CAP is. 🙂

Exit mobile version