An Open Letter To Monitoring/Metrics/Alerting Companies

allspaw

10 years ago

I’d like to open up a dialogue with companies who are selling X-As-A-Service products that are focused on assisting operations and development teams in tracking the health and performance of their software systems.

Note: It’s likely my suggestions below are understood and embraced by many companies already. I know a number of them who are paying attention to all areas I would want them to, and/or make sure they’re not making claims about their product that aren’t genuine.

Anomaly detection is important. It can’t be overlooked. We as a discipline need to pay attention to it, and continually get better at it.

But for the companies who rely on your value-add selling point(s) as:

“our product will tell you when things are going wrong” and/or
“our product will automatically fix things when it finds something is wrong”

the implication is these things will somehow relieve the engineer from thinking or doing anything about those activities, so they can focus on more ‘important’ things. “Well-designed automation will keep people from having to do tedious work”, the cartoon-like salesman says.

Please stop doing this. It’s a lie in the form of marketing material and it’s a huge boondoggle that distracts us away from focusing on what we should work on, which is to augment and assist people in solving problems.

Anomaly detection in software is, and always will be, an unsolved problem. Your company will not solve it. Your software will not solve it. Our people will improvise around it and adapt their work to cope with the fact that we will not always know what and how something is wrong at the exact time we need to know.

My suggestion is to first acknowledge this (that your attempts to detect anomalies perfectly, at the right time, is not possible) when you talk to potential customers. Want my business? Say this up front, so we can then move on to talking about how your software will assist my team of expert humans who will always be smarter than your code.

In other words, your monitoring software should take the Tony Stark approach, not the WOPR/HAL9000 approach.

These are things I’d like to know about how you thought about your product:

Tell me about how you used qualitative research in developing your product.
Tell me about how you observed actual engineers in their natural habitat, in the real world, as they detected and responded to anomalies that arose.
Show me your findings from when you had actual UX/UI professionals consider carefully how the interfaces of your product should be designed.
Demonstrate to me the people designing your product have actually been on-call and have experience with the scenario where they needed to understand what the hell was going on, had no idea where to start looking, all under time and consequence pressure.
Show me the people who are building your product take as a first design principle that outages and other “untoward” events are handled not by a lone engineer, but more often then not by a team of engineers all with their different expertise and focus of attention. Successful response depends on not just on anomaly detection, but how the team shares the observations they are making amongst each other in order to come up with actions to take.

Stop thinking you’re trying to solve a troubleshooting problem; you’re not.

The world you’re trying to sell to is in the business of dynamic fault management. This means that quite often you can’t just take a component out of service and investigate what’s wrong with it. It means diagnosis involves testing hypotheses that could actually make things a lot worse than they already are. This means that phases of responding to issues have overlapping concerns all at the same time. Things like:

I don’t know what is going on.
I have a guess about what is going on, but I’m not sure, and I don’t know how to confirm it.
Because of what Sue and Alice said, and what I see, I think what is going on is X.
Since we think X is happening, I think we should do Y.
Is there a chance that Y will make things worse?
If we don’t know what’s happening with N, can we do M so things don’t get worse, or we can buy time to figure out what to do about N?
Do we think this thing (that we have no clue about) is changing for the better or the worse?
etc.

Instead of telling me about how your software will solve problems, show me you’re trying to build a product that is going to join my team as an awesome team member, because I’m going to think about using/buying your service in the same way I think about hiring.

Sincerely,

John Allspaw