Like lots of operations people, we’re quite addicted to data pr0n here at Flickr. We’ve got graphs for pretty much everything, and add graphs all of the time. We’ve bloggedabout some of how and why we do it.
One thing we’re in the habit of is screenshotting these graphs when things go wrong, right, or indifferent, and adding them to a group on Flickr. I’ve decided to make a public group for these sort of screenshots, for anyone to contribute to:
You should realize before posting anything here, that you might want to think about if you want everyone in the world to see what you’ve got. I’ve made a quick FAQ on the groups page, but I’ll repeat it here:
Q: What is this?
A: This group is for sharing visualizations of web operations metrics. For the most part, this means graphs of systems and application metrics, from software like ganglia, cacti, hyperic, etc.
Q:Who gets to see this?
A: This is a semi-public group, so don’t post anything you don’t want others to see.
For now, it’ll be for members-only to post and view. Ideally, I think it’d be great to share some of these things publicly.
Q: What’s interesting to post here?
A: Spikes, dips, patterns. Things with colors. Shiny things. Donuts. Ponies.
Q: My company will fire me if I show our metrics!
A: Don’t be dense, and post your pageview, revenue, or other super-secret stuff that you think would be sensitive. Your mileage may vary.
So: you’ve got something to brag about? How many requests per second can your awesome new solid-state-disk database do? You got spikes? Post them!
The CFP for next year’s Velocity Conference is up now, so all you ops and performance ninjas submit your ideas for talks.
I’m lucky enough to be on the program committee this year, and I think the conference is a huge opportunity to spread the ops love on all kinds of topics. There’s a list on the O’Reilly page to get you thinking about what might make for a good submission:
- How to tie web performance and operations to the bottom line
- Real-world incident management – getting “tight like a pit crew”
- Making websites as fast and reliable as desktop apps
- Networking, DNS, and load balancing
- Profiling’s not just on the backend: JavaScript, CSS, and the network
- Managing web services – flaming disasters you survived and lessons learned
- The intersection between performance and design
- Wicked cool (and actionable) metrics
- Ads, ads, ads – the performance killer?
- Troubleshooting in production
- How to scale and be fast on the social web
- Capacity planning and load testing
- Establishing performance and operations best practices within your organization
- Configuration management best (and worst) tools and practices
- Monitoring and instrumentation: Open Source, as a service, commercially supported solutions
- Using multiple CDNs to improve customer experience and reduce cost
Think for a minute: Do you have a bunch of sweet ops hacks that you’re really proud of? Do you and your dev teams collaborate on making things easy to manage? Do you face unique challenges that others don’t which ops folks can learn from?
Gil Raphaelli, one of the guys on our Flickr Ops team, put together a Code Swarm animation for the configuration/deployment management tool we use at Flickr to manage our infrastructure. Myles Grant did this for our bug reporting system as well. Check it out:
Our automated config management system is called Gemstone, but conceptually you can think of it as a pretty extensible SystemImager/Puppet/cfengine-style system. In the animation, the dots are changes made by the ops person shown. The legend is: transforms: this is what cluster should have what packages, files, actionable scripts, etc. raw: these are actual files, like apache/memcached/squid configs, which get munged depending on what cluster they might be in conf: this is what boxes/clusters are subsets or supersets of which clusters code: ops-written tools/utilities Misc: stuff that doesn’t fit into the above.
Also, it looks like the GM devs are working on getting OpenMP (parallelism) put into GM processing, which will be a huge boom for multicore boxes. Yay!
It’s hard to describe how tiring it is to hear someone quote Donald Knuth (or Tony Hoare) in the wrong context. I’m not theonlyoneannoyedby this. In “Structured Programming with go to Statements”, Knuth says:
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
After having read Knuth’s paper containing this quote, I can agree that it’s certainly a brilliant piece of advice in the context of programming. What is irritating to me is the blanket application of this pearl of wisdom to anything that has to do with computers, especially systems performance, web operations and architecture decisions.
For the record: I firmly believe in these principles:
Done >= perfect.
Don’t waste time building elaborate simulations for what the future might bring to your capacity.
Performance tuning is better left outside the capacity planning process.