That was a pretty good time. Saw lots of good and wicked smaht people, and I got a lot of great questions after my talk. The slides are up on slideshare, and here are the PDF slides.
Operational Efficiency Hacks Web20 Expo2009
UPDATE: Gil Raphaelli has posted his python bindings he wrote for our libyahoo2 use in our Ops IM Bot.
There was something that I left out of my slides, mostly because I didn’t want to distract from the main topic, which was optimization and efficiencies.
While I used our image processing capacity at Flickr as an example of how compilers and hardware can have some significant influence on how fast or efficient you can run, I had wondered what the Magical Cloudâ„¢ would do with these differences.
So I took the tests I ran on our own machines and ran them on Small, Medium, Large, Extra Large, and Extra Large(High) instances of EC2, to see. The results were a bit surprising to me, but I’m sure not surprising to anyone who uses EC2 with any significant amount of CPU demand.
For the testing, I have a script that does some super simple image resizing with GraphicsMagick. It splits a DSLR photo into 6 different sizes, much in the same way that we do at Flickr for the real world. It does that resizing on about 7 different files, and I timed them all. This is with the most recent version of GraphicsMagick, 1.3.5, with the awesome OpenMP bits in it.
Here is the slide of the tests run on different (increasingly faster) dedicated machines:
and here is the slide that I didn’t include, of the EC2 timings of the same test:
Now I’m not suggesting that the two graphs should look similar, or that EC2 should be faster. I’m well aware of the shift in perspective when deploying capacity within the cloud versus within your own data center. So I’m not surprised that the fastest test results are on the order of 2x slower on EC2. Application logic, feature designs (synchronous versus asynchronous image processing, for example) can take care of these differences and could be a welcome trade-off in having to run your own machines.
What I am surprised about is the variation (or lack thereof) of all but the small instances. After I took a closer look at vmstat and top, I realized that the small instances consistently saw about 50-60% CPU stolen from it, the mediums almost always saw zero stolen, and the Large and ExtraLarges saw up to 35% CPU stolen from it during the jobs.
So, interesting.
If you’re expecting a comment on the tech content of your talk, it was great, as to be expected. But more importantly, the last slide in your presentation is AWESOME.
I don’t see anything in Amazon’s description of EC2 which says that it provides exclusive access to the CPUs. In fact, the same system could be running a number of VMs hosting different operating systems for different users. Perhaps the CPUs also provide processing for Amazon’s web site. Amazon suggests that EC2 uses spare computing resources. If the access is not exclusive then it can be expected that CPU caches will be reloaded and there will be more overhead. Optimum OpenMP performance depends on optimum use of CPU cache and minimal latency.
Reports I have read elsewhere (e.g. USENIX Login) found similar performance characteristics from Amazon’s EC2.
I don’t know if GCC/GOMP always consumes total number of cores available, or if the Linux OS always offers all of the cores. Perhaps there is an upper limit to the default. Some operating systems require administrative action before a user has access to more cores.
Pingback: 秋元@ç”·å産å‰ä¼‘暇ブãƒã‚° » FlickrãŒPHP4ã‹ã‚‰PHP5ã«ç§»è¡Œ
You mentioned some self-healing scripts in your talk @ Web 2 expo. Do you think you might be able to post some of those (mysql if possible)?
Great talk btw
Rehan: I’m planning on it, for sure…just need to make sure that The Mothershipâ„¢ Yahoo is ok with me posting them. 🙂
Pingback: How to Succeed at Capacity Planning Without Really Trying : An Interview with Flickr’s John Allspaw on His New Book | Unix Stuff
Pingback: Confluence: Project: pxCream
Pingback: Cloud and Dependability | andy.edmonds.be
Pingback: Беби блог df d sdfdsfdsf sdf
Pingback: Cloud Dependability and High Availability – old | InIT Cloud Computing Lab