I really don’t think it can be overestimated how important context can be when it comes to troubleshooting or evaluating the health of an infrastructure. When starting to troubleshoot a complex problem, web ops 101 “best practices” usually start with asking at least these questions:
- When did this problem start?
- What changes, if any, (software, hardware, usage, environmental, etc.) were made just previous to the start of the problem?
The context surrounding these problem events are pretty damn critical to figuring out what the hell is going on.
Most monitoring systems are based around the idea that you want to know if a particular metric is above (or sometimes below) a certain threshold, and have ‘warning’ or ‘critical’ values that represent what is going bad or already bad. When these alarms go off, knowing how and when they got there is really important your troubleshooting approach. This context is paramount in figuring out where to spend your time and focus.
For example: an alarm goes off because a monitor has detected that some metric has reached a critical state. Something that goes critical instantly can be quite different than something that edged into critical after being in a warning state for some time.
Check it out:
For this discussion, the actual metric here isn’t that important. It could be CPU on a webserver, it could be latency on a cache hit or miss on memcached/squid/varnish/etc, or it could be network bandwidth on a rack switch. The values you set for warning and critical are normally informed by how much tolerance the system can withstand being in warning mode, and given ‘normal’ failure modes, and allow enough wall-clock time for recovery actions to take place before it reaches critical.
Most people would approach these two scenarios quite differently, because of the context that time lends to the issue.
In the book, I give an example of how valuable this context is in troubleshooting interconnected systems. When metrics from different clusters or systems are laid right next to each other, significant changes in usage can be put into the right context. Cascading failures can be pretty hard to track down to begin with. Tracking them down without the big picture of the system is impossible. That graph you’re using for troubleshooting: is it showing you a cause, or symptom?
Because context is so important, I’m a huge fan of overlaying higher-level application statistics with lower-level systems ones. This guy has a great example of it over on the Web Ops Visualization group pool:
He’s not just measuring the webserver CPU, he’s also measuring the ratio of requests per second to total CPU. This is context that can be hugely valuable. If any of the underlying resources change (faster CPUs, more caching on the back-end, application optimizations, etc.) he’ll be able to tell quickly how much benefit he’ll gain (or lose) by tracking this bit.
At the Velocity Summit, Theo mentioned that since OmniTI started throwing metrics for all their clients into reconnoiter, they almost always plot their business metrics on top of their system metrics, because why the hell not? Even if there’s no immediate correlation, it gives their system statistics the context needed for the bigger picture, which is:
How is my infrastructure actually enabling my business?
I’ll say that gathering metrics is pretty key to running a tight ship, but seeing them in context is invaluable.