I thought it might be worth digging in a bit deeper on something that I mentioned in the Advanced Postmortem Fu talk I gave at last year’s Velocity conference. For complex socio-technical systems (web engineering and operations) there is a myth that deserves to be busted, and that is the assumption that for outages and...
Continue reading...
While searching around for something else, I came across this note I sent in late 2009 to the executive leadership of Yahoo’s Engineering organization. This was when I was leaving Flickr to work at Etsy. My intent on sending it was to be open to the rest of Yahoo about what how things worked at...
Continue reading...
In yet another post where I point to a paper written from the perspective of another field of engineering about a topic that I think is inherently mappable to the web engineering world, I’ll at least give a summary. 🙂 Every time someone on-call gets an alert, they should always be thinking along these lines:...
Continue reading...
Ben Rockwood said something last December about the re-emergence of the Systems Engineer and I agree with him, 100%. To add to that, I’d like to quote the excellent NASA Systems Engineering handbook’s introduction. The emphasis is mine: Systems engineering is a methodical, disciplined approach for the design, realization, technical management, operations, and retirement of...
Continue reading...
This little ramble of thoughts are related to my talk at Velocity coming up, but I know I’ll never get to this part at the conference, so I figured I’d post about it here. Building resilience from a systems point of view means (amongst other things) understanding how your organization deals with failure and unexpected...
Continue reading...
I’ve been drafting this post for a really long time. Like most posts, it’s largely for me to get some thoughts down. It’s also very related to the topic I’ll be talking about at Velocity later this year. When I gave a keynote talk at the Surge Conference last year, I talked about how our...
Continue reading...
Etsy’s Chef Repo, 2010 from jspaw on Vimeo. Delicious InfoViz courtesy of Gource....
Continue reading...
UPDATE, 10/17/2017: This post hasn’t aged well, and needs some patching. The title should be “TTR is more important than TBF (for most types of F)” Why? Because taking the statistical mean of TTR or TBF makes absolutely no sense, whatsoever. Incidents and events simply are not comparable in that way, and even if they were, the time...
Continue reading...
Last month I had the honor of speaking at the Surge Conference in Baltimore, put together by OmniTI. It was a most excellent conference, and the expertise levels were ridiculously high. I count myself lucky to be considered the same league as the rest of the presenters. I did give a Keynote talk, and I...
Continue reading...
Protip: if you’re getting Nagios alerts on an iPhone, and you have your contact set as:Â xxx-xxx-xxxx@txt.att.net, you’ll get messages from a ‘sender’ that looks like: “1 (410) 000-173”. This is not someone in Maryland, it’s a special address so that AT&T can route a reply back to the sender if need be. The side...
Continue reading...