In yet another post where I point to a paper written from the perspective of another field of engineering about a topic that I think is inherently mappable to the web engineering world, I’ll at least give a summary. 🙂
Every time someone on-call gets an alert, they should always be thinking along these lines:
- Does this really require me to wake up from sleeping or pause this movie I’m watching, to fix?
- Can this really not wait until the morning, during office hours?
If the answer is yes to those, then excellent: the machines alerted a human to something that only a human could ever diagnose or fix. There was nothing that the software could have done to rectify the situation. Paging a human was justified.
But for those situations where the answer was “no” to those questions, one might (or should, anyway) think of bolstering your system’s “fault tolerance” or “fault protection.” But how many folks grok the full details of what that means? Does it mean self-healing? Does it mean isolation of errors or unexpected behaviors that fall outside the bounds of normal operating circumstances? Or does it mean both and if so how should we approach building this tolerance and protection? The Wikipedia definitions for “fault tolerant systems” and “fault tolerant design” are a very good start on educating yourself on the concepts, but they’re reasonably general in scope.
The fact is, designing web systems to be truly fault-tolerant and protective is hard. These are questions that can’t be answered solely within infrastructural bounds; fault-tolerance isn’t selective in its tiering, it has to be thought of from layer 1 of the network all the way to the browser.
Now, not every web startup is lucky enough to hire someone from NASA’s Jet Propulsion Lab, who has written software for space vehicles, but we managed to convince Greg Horvath to leave there and join Etsy. He pointed me to an excellent paper, by Robert D. Rasmussen, called “GN&C Fault Protection Fundamentals” and thankfully, it’s a lot less about Guidance, Navigation, and Control and more about fault tolerance and protection strategies, concerns, and implementations.
Some of those concerns, from the paper:
- Do not separate fault protection from normal operation of the same functions.
- Strive for function preservation, not just fault protection.
- Test systems, not fault protection; test behavior, not reflexes.
- Cleanly establish a delineation of mainline control functions from transcendent issues.
- Solve problems locally, if possible; explicitly manage broader impacts, if not.
- Respond to the situation as it is, not as it is hoped to be.
- Distinguish fault diagnosis from fault response initiation.
- Follow the path of least regret.
- Take the analysis of all contingencies to their logical conclusion.
- Never underestimate the value of operational flexibility.
- Allow for all reasonable possibilities – even the implausible ones.
The last idea there points to having “requisite imagination” to explore as fully as possible, the question “What could possibly go wrong?”, which is really just another manifestation of one of the four cornerstones of Resilience Engineering, which is: “Anticipation”. But that’s a topic for another post.
Here’s Rasmussen’s paper, please go and read it. If you don’t, you’re totally missing out and not keeping up!