Nagios alerts on the iPhone – deleting boatloads

Protip: if you’re getting Nagios alerts on an iPhone, and you have your contact set as:  xxx-xxx-xxxx@txt.att.net, you’ll get messages from a ‘sender’ that looks like: “1 (410) 000-173”. This is not someone in Maryland, it’s a special address so that AT&T can route a reply back to the sender if need be. The side...
Continue reading...

The new book: Web Operations

At the Velocity Conference last year, I was talking to Mike Loukides from O’Reilly about the topics being presented and how it was so great to see such successful veterans of the field come out from behind the curtain and share their experiences. Mike said that there was interest in doing a book on the...
Continue reading...

How Complex Systems Fail: A WebOps Perspective

I guess I’m late on getting to this, but How Complex Systems Fail by Richard Cook is excellent. Let me start with this: I don’t think I can overstate how right-on this paper is, with respect to the challenges, solutions, observations, and concerns involved with operating a medium to large web infrastructure. I found this...
Continue reading...

Meanwhile: More Meta-Metrics

Like all sane web organizations, we gather metrics about our infrastructure and applications. As many metrics as we can, as often as we can. These metrics, given the right context, helps us figure out all sorts of things about our application, infrastructure, processes, and business. Things such as… What: …did we do before (historical trending,...
Continue reading...

Some Things We Did Today

Moving one of our eight photoserving farms from hardware Layer7 URL hash balancing (expensive, has limits) to L4 DSR balancing with CARP (cheap and simple) and figuring out how to juggle 18,000 requests/second while we do it. Built yet some more automated query analysis reporting (with some yummy MySQLProxy) Added yet another aggregated graph of...
Continue reading...