Large Data Sets: A Case Against the Public Cloud
Data sets are getting larger and more ubiquitous by the day. Scientists are able to get petabytes of data from large experiments and run complex data analytics using vast amounts of resources. These datasets are being generated on an ongoing basis. Recently Amazon announced their HPC offering with beefy machines and the promise of low latency and high bandwidth between VMs. The problem is that moving and storing data in the cloud is very expensive. If data was to be stored in S3 you're looking at 10 cents per gigabyte (starting November 1st) to move the data in, along with the storage cost of 5.5 cents per gigabyte (at the cheapest) a month. If you're dealing with just a terabyte of data, we're talking over 150 dollars a month as a starting point. Want to start talking about petabyte datasets? Multiple that by a thousand. As I have already mentioned, these data sets will grow larger and larger, and with just the data starting to cost that much it might be a wise decision to roll your own cloud with something like Eucalyptus, or go cloud-less altogether.
While I was interning at Lawrence Livermore National Labs this past summer I was thinking of what could they possibly do with a private cloud infrastructure. They aim to squeeze every iota of computation out of those machines and putting a level of virtualization better have some great benefits. The application area seems small enough that there really was not much of a need to be dynamic in the images one needs to run. A possible benefit was to allow users to specify whichever OS and application set they wanted to boot, but even this can be done simply with diskless booting over NFS.