24th April 2008

Some Comments on EC2 and Ops

posted in Technology |

The following is adapted from an email I sent to the Seattle Tech Startups mailing list. The conversation started when someone asked about peoples experience with EC2. Eventually someone pointed out that its important to take into account your time when doing the cost calculations between EC2 vs. building out your own data center, pointing out that many people don’t take into account how much effort goes into building out and running a data center.

However I don’t buy the argument that somehow running EC2 magically uses a lot less of your time than running your own data center. I’ve seen this argument many times, but frankly the cost of sticking a couple of servers in the data center in terms of time maintaining the hardware can be very low. And the time to maintain EC2, write custom code to maintain your instances, deal with the quirks of their environment, build extra code to deal with their failure modes can easily overwhelm the time it takes to stick a couple of boxes in a rack somewhere.

Eventually these environments might get more automatic- one of the nice things about the the Google Engine stuff is that as far as I can tell it’s a heck of a lot more self-managing than EC2. But for right now, either way you are going to spend a bunch of time on OPs, and the hardware aspect of it will be the low order bits…

Just as an example- I stuck a server in a data center in December 2006. I haven’t touched it since (physically). So it took me a few hours to setup, and since then I’ve been lucky and its been trouble-free. Sure I have to make sure the OS updates (er, all 3 of them since I’m running VMs on it). But the same is true for your EC2 images. Since I installed it my server has given me the equivalent of $4000 worth of EC2 time, and it cost me $3000 ($1600 + ~$1200 in hosting fees). Because its running a really simple configuration and I have direct control over it, its saved me far more than that in ease of development time.

Sure you say, but I’m going to be huge and will need to manage 20+ servers. However, at that point you are going to need someone to deal with ops. 20 servers in EC2 aren’t going to manage themselves any more than 20 will in a data center. Now granted, in EC2 you can scale up from 1 instance to 60 (equivalent of the 20 servers) in an instant by just kicking off those instances, right? But is your software really designed so you can just duplicate your machine image 60 times and it works right? If I need more machines in my data-center I can call up Dell and have them installed in a week. Sure its not over-night, but unless you hit the lottery and get an unexpected extreme traffic spike, it will usually work out. Even with 20 machines (this is based on some recent experience at DeepRockDrive) the hardware part of the setup is likely only a couple of hours to unpack all the deliveries from Dell, plug them into some good power, run some network cables, and flip on all the switches. Set things up right and from that point you have just created a single OS image and can just copy that image to all of the machines. In other words, once the physical hardware is plugged in, the software part works pretty much the same as EC2 does. (Disclaimer- there are lots of more complicated and time consuming ways to do this, so I’m sure many folks have examples of times it took weeks. But modern OS imaging if you know what you are doing is very cool and works great.)

As part of a recent project I did some cost calculations. Just working in to the equation raw CPU, EC2 can be cost effective if your daily traffic pattern exceeds a 8-1 ratio. In other words, if you expect that you will need ~8 instances for 3 hours a day and can get by with just 1 instance for the other 21 hours, it can be cost effective. That is some pretty extreme peaks- keep in mind that if those other 21 hours you actually need 2 instances the peak would have to be 16 to be justified. Of course that is assuming that scaling is just a matter of CPU- depending on how you build your site, your scale limiting factor might e something like your MySql database, in which case the super-large instance is still not as big as you can buy with commodity hardware (even in 1u factor) so you just have an upper limit with EC2. The biggest EC2 instance the “extra large” provides 8 EC2 “compute units” while an 8-core Xeon system you can configure from Dell for less than $3000 should be able to provide the equivalent of 24 “compute units” in one instance.

One really key disclaimer here- my calculations were not factoring in bandwidth. The best part of EC2 as far as I can tell is that the bandwidth charges are really really low compared to what you are going to see in your own datacenter so if bandwidth is going to be a major cost factor, that does tip in the favor of EC2, and they even announced some price-cuts coming next month so that angle does get even better in favor of the EC2 route. I have to admit that with the progress of technology I’m really surprised that prices don’t seem to have dropped that much on bandwidth charges at the big co-location centers.

In any case, back to the main point- Ops in EC2 are not free. The Ops tasks are different than with a conventional data-center, but its not at all clear that one or the other requires less time on ops-type tasks. These types of services have a great potential, but right now they are still in their infancy. Building a “conventional” site is something that there are some very well established practices around, both for site architecture and ops. Granted, especially in this area there are some EC2 experts, but I’d really think twice about tackling it unless you have one of those experts on your team. When you look at your new business, do you want to distinguish yourself with your new service for your customers, or by innovating in how you host it?

Leave a Reply