I really don’t think it can be overestimated how important context can be when it comes to troubleshooting or evaluating the health of an infrastructure. When starting to troubleshoot a complex problem, web ops 101 “best practices” usually start with asking at least these questions: When did this problem start? What changes, if any, (software,...
Continue reading...
This is a ramble continued from before, which means it’s mostly a blog post for me, but maybe others might find it interesting. The last time I made an analogy between back-end web architectures and mechanical structures, I blathered on about what are basically structural limitations of individual components in a physical device, and how...
Continue reading...
That was a pretty good time. Saw lots of good and wicked smaht people, and I got a lot of great questions after my talk. The slides are up on slideshare, and here are the PDF slides. Operational Efficiency Hacks Web20 Expo2009 View more presentations from John Allspaw. UPDATE: Gil Raphaelli has posted his python...
Continue reading...
It’s been wondered about why I chose not to include any real amount of material in my book about the mathematical topics related to capacity planning, like queueing theory. There are already many other excellent books that dig into the math behind Little’s Law, M/M/1 queues, and Poisson arrival processes. These concepts do indeed detail...
Continue reading...
Moving one of our eight photoserving farms from hardware Layer7 URL hash balancing (expensive, has limits) to L4 DSR balancing with CARP (cheap and simple) and figuring out how to juggle 18,000 requests/second while we do it. Built yet some more automated query analysis reporting (with some yummy MySQLProxy) Added yet another aggregated graph of...
Continue reading...
Looks like I’m gonna talk about even more nerdy things at the Web2.0 Expo in April. You don’t have to wait for a recession to tighten up your operations. Squeezing more oomph out of your servers (or instances!) is always a good thing, and streamlining how you handle site issues is too. We’ll will talk...
Continue reading...
(This is Part 1. Part 2 is here.) I don’t blog much, and when I do, they are pretty short and too the point. This post is different: feel free to put into the “ramble” category. I’m really just posting it here for myself as a thought exercise. Some years ago, while drawing a network...
Continue reading...
Like lots of operations people, we’re quite addicted to data pr0n here at Flickr. We’ve got graphs for pretty much everything, and add graphs all of the time. We’ve blogged about some of how and why we do it. One thing we’re in the habit of is screenshotting these graphs when things go wrong, right,...
Continue reading...
The CFP for next year’s Velocity Conference is up now, so all you ops and performance ninjas submit your ideas for talks. I’m lucky enough to be on the program committee this year, and I think the conference is a huge opportunity to spread the ops love on all kinds of topics. There’s a list...
Continue reading...
Gil Raphaelli, one of the guys on our Flickr Ops team, put together a Code Swarm animation for the configuration/deployment management tool we use at Flickr to manage our infrastructure. Myles Grant did this for our bug reporting system as well. Check it out: Our automated config management system is called Gemstone, but conceptually you...
Continue reading...