Large Data Sets: A Case Against the Public Cloud

Data sets are getting larger and more ubiquitous by the day. Scientists are able to get petabytes of data from large experiments and run complex data analytics using vast amounts of resources. These datasets are being generated on an ongoing basis. Recently Amazon announced their HPC offering with beefy machines and the promise of low latency and high bandwidth between VMs. The problem is that moving and storing data in the cloud is very expensive. If data was to be stored in S3 you're looking at 10 cents per gigabyte (starting November 1st) to move the data in, along with the storage cost of 5.5 cents per gigabyte (at the cheapest) a month. If you're dealing with just a terabyte of data, we're talking over 150 dollars a month as a starting point. Want to start talking about petabyte datasets? Multiple that by a thousand. As I have already mentioned, these data sets will grow larger and larger, and with just the data starting to cost that much it might be a wise decision to roll your own cloud with something like Eucalyptus, or go cloud-less altogether.

While I was interning at Lawrence Livermore National Labs this past summer I was thinking of what could they possibly do with a private cloud infrastructure. They aim to squeeze every iota of computation out of those machines and putting a level of virtualization better have some great benefits. The application area seems small enough that there really was not much of a need to be dynamic in the images one needs to run. A possible benefit was to allow users to specify whichever OS and application set they wanted to boot, but even this can be done simply with diskless booting over NFS.

Hadoop Cluster Stalls on Startup

Our diskless boot hadoop cluster had a very weird problem during startup. After running the start-all script, all the logs would stop on 

2010-09-20 15:25:26,061 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50030
2010-09-20 15:25:26,061 INFO org.mortbay.log: jetty-6.1.14

And nothing would come up for some time. After a while we may have been lucky and saw some nodes come up. Also, if the filesystem was to be touched on a particular node that node would all the sudden come online. After some online searching by my co-worker, Brian Batinich, found that the problem was the following: 

Line 504 of
${HADOOP_HOME}/src/hdfs/org/apache/hadoop/hdfs/server/datanode/DataNode.java
makes a call to a SecureRandom object to get a random number to use as
part of the storage ID
string for the DataNode.  The random number is used with the IP
address of the node and the current time
to 'guarantee' a unique string for the DataNode.  According to the
Java API, the SecureRandom object
"must" produce non-deterministic output.  The hang is a result of the
SecureRandom object waiting for
the kernel's entropy pool to fill up with strong enough random data.
There isn't a time limit for the object,
so it will just wait forever if it has to.

The jdk1.5 version
that is bundled with RHEL is unaffected by
the SecureRandom hang and that jdk1.6 requires the random number
generator as called by the Hadoop code.
I've tried the test with updates 12,20, and 21 of the jdk1.6 and they
all give the same result.

The fix was to run the following line to produce the random numbers needed:
/sbin/rngd -r /dev/urandom -o /dev/random -f -t 1 &

 

Once this was running on all the machines the namenode, jobtracker, and friends came up within a couple of seconds.

Getting the Most Out of Your Hardware for your MapReduce Job

When trying to squeeze out the most of your hardware you will want to either hit one of three bottlenecks: CPU, disk, or memory. I’m currently configuring a cluster with Fusion-IO drives that can give up to 800MB/s for reading and writing. I want to run my jobs and either hit 200k reads and writes per second, or max out my CPUs. I get this information from running "iostat -x 5", which gives statistics every five seconds. 
I've been under-utilizing the Fusion drives, getting around 30k writes per second. Yet my CPU utilization is less than 50% (each node has 16 cores).  I went ahead and increased the number of mappers per machine incrementally up to 24 mappers. The problem I see is a slew of errors and warnings coming out of Hadoop. 

“Unable to create a new native thread”
“Could not obtain block”
“Task process exit with non-zero status”
 

This is because I've run out of memory (each machine has 24 gigs of ram) and my system does not have swap. TaskTrackers slowly start getting black listed and the job completely fails. What sucks is that the best configuration is very job dependent. Some jobs are CPU bound while others are IO bound. In my case all jobs are memory bound.

With 20 or so mappers per node iostat now shows that I'm doing over 60k read and write requests on the drives and a CPU utilization of 75 percent. 

Child JVM Heap Memory
I was also running into some issues with child JVM's not having enough heap space when running the wordcount benchmark. This was solved by setting 
  <property>
        <name>mapred.child.java.opts</name>
        <value>-Xmx512M</value>
    </property>
in the mapred-site.xml configuration file. 256MB for the sort example looks to be enough though.

Trying to build Hive

I had a problem setting up Hive using ant 1.6.5. The problem seemed similar to when I was trying to build Hadoop with an old version of ant. I pointed it to my local 1.8.1 version of ant.

Using the old version:
ant package
Buildfile: build.xml

BUILD FAILED
/home/user/hive/hive/build.xml:52: Class org.apache.tools.ant.taskdefs.ConditionTask doesn't support the nested "matches" element.
Total time: 0 seconds

Using the new version:
${ANT_HOME}/bin/ant package

A new problem occurred with the error output being
Buildfile: /home/user/hive/build.xml


install-hadoopcore-internal:

build_shims:
     [echo] Compiling shims against hadoop 0.17.2.1 (/home/user/hive/build/hadoopcore/hadoop-0.17.2.1)
    [javac] /home/user/hive/shims/build.xml:48: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
    [javac] Compiling 5 source files to /home/user/hive/build/shims/classes
    [javac] /home/user/hive/shims/src/0.17/java/org/apache/hadoop/hive/shims/Hadoop17Shims.java:105: method does not override a method from its superclass
    [javac]   @Override
    [javac]    ^
    [javac] /home/user/hive/shims/src/0.17/java/org/apache/hadoop/hive/shims/Hadoop17Shims.java:118: method does not override a method from its superclass
    [javac]   @Override
    [javac]    ^
    [javac] /home/user/hive/shims/src/0.17/java/org/apache/hadoop/hive/shims/Hadoop17Shims.java:131: method does not override a method from its superclass
    [javac]   @Override
    [javac]    ^
    [javac] 3 errors

BUILD FAILED
/home/user/hive/build.xml:162: The following error occurred while executing this line:

/home/user/hive/build.xml:105: The following error occurred while executing this line:
/home/user/hive/shims/build.xml:57: The following error occurred while executing this line:
/home/user/hive/shims/build.xml:48: Compile failed; see the compiler error output for details.

Total time: 8 seconds

It looked to be a problem with the version of java I was using. I set my JAVA_HOME path to point to a new version of java. 
export JAVA_HOME="/usr/global/jdk/jdk1.6.0_20/"

 And I reran my previous command
${ANT_HOME}/bin/ant package
BUILD SUCCESSFUL
Total time: 2 minutes 51 seconds

Now the Hive build was a great success =)

AppScale 1.4 just released

We've finally got our newest version of AppScale out thanks to the hard work of the AppScale team. Check out the new and improved AppScale at http://code.google.com/p/appscale. Here are some cool new features:
  1. Install appscale using apt-get
  2. Advance placement of components can be specified in a yaml file 
  3. We have transaction support for all the backends
  4. Java GAE has more supported APIs
  5. HTTPS support
  6. and much more...

You Should Blog Too

How many times have you run into a problem that other people have already run into, solved, and then blogged about it? Now how many times have you not found the solution online, solved it, but did not share your trials and tribulations with the community? Blogging about your technical feats, no matter how small may help someone later on. Furthermore, you may run into the same problem later on but lack documentation on how you previously solved it. Even if others have blogged about the solution you've come up with, blog about it anyway. Giving multiple view points helps others comprehend the material. I know it helps me and it gives me a feeling of joy knowing that I've contributed back to the community. 

Patching and building Hadoop


Patching Hadoop
The problem I was running into was a NumberFormat exception. The exception was being tossed when a TaskTracker would try to parse the output of “df –k” and an entry for the used percentage was returning ‘-’. Also the amount used was giving a negative number. This is clearly some issue with either df or the OS, but I had to circumvent it by patching hadoop rather than reformatting the drive or what have you.
Here is a link to someone with the same issue:

The code change was in ${HADOOP_DIR}/src/core/org/apache/hadoop/fs/DF.java
I made the following changes:
  ...
  if(this.used < 0){
    this.used = this.used * -1;
  }
  this.available = Long.parseLong(tokens.nextToken()) * 1024;
  try{ 
     this.percentUsed = Integer.parseIn(tokens.nextToken());
  } 
  catch(NumberFormatException nfe){
    this.percentUsed = 
  }
   ...

Building Hadoop

The machine I was running had an old version of ant. 

BUILD FAILED
/usr/global/hadoop/hadoop-0.20.2-dit/build.xml:1624: Class org.apache.tools.ant.taskdefs.ConditionTask doesn't support the nested "typefound" element.
I went and downloaded a new version from the ant website.

I untared the tarball then set ANT_HOME to the top level directory of where it was untared. In the top level Hadoop folder I ran “${ANT_HOME}/bin/ant jar”
The build was successful. I backed up the old core jar file and replaced it with the same name from the new core file in the build directory. The new one will have a different name with a newer version and “dev”. 

1 of 1


Posterous theme by Cory Watilo