Filed under: Hadoop

How To Use a Patched Hadoop with HBase 0.89

The new HBase 0.89 dev release uses Maven for its build process. When patching hadoop, copying the jar to the lib directory used to be enough. For Maven you must modify the pom.xml to tell it to use the local hadoop core jar file. 

The correct way to do it is to follow 

Running mvn -DskipTests install
with it pointing to my hadoop jar as directed by the above link was not working. Maven was spitting out 
[INFO] ------------------------------------------------------------------------
[ERROR] BUILD ERROR
[INFO] ------------------------------------------------------------------------
[INFO] Failed to resolve artifact.

Missing:
----------
1) org.apache.hadoop:hadoop-core:jar:0.20.2

  Try downloading the file manually from the project website.

  Then, install it using the command: 
      mvn install:install-file -DgroupId=org.apache.hadoop -DartifactId=hadoop-core -Dversion=0.20.2 -Dpackaging=jar -Dfile=/path/to/file

  Alternatively, if you host your own repository you can deploy the file there: 
      mvn deploy:deploy-file -DgroupId=org.apache.hadoop -DartifactId=hadoop-core -Dversion=0.20.2 -Dpackaging=jar -Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id]

  Path to dependency: 
   1) org.apache.hbase:hbase:jar:0.89.20100924
   2) org.apache.hadoop:hadoop-core:jar:0.20.2

----------
1 required artifact is missing.

for artifact: 
  org.apache.hbase:hbase:jar:0.89.20100924

from the specified remote repositories:

[INFO] ------------------------------------------------------------------------
[INFO] For more information, run Maven with the -e switch
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 3 seconds
[INFO] Finished at: Thu Nov 18 10:45:00 UTC 2010
[INFO] Final Memory: 34M/205M
[INFO] ------------------------------------------------------------------------

I tried using the first recommended command, but that failed. 

[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[ERROR] BUILD ERROR
[INFO] ------------------------------------------------------------------------
[INFO] Error building POM (may not be this project's POM).

Project ID: com.agilejava.docbkx:docbkx-maven-plugin
POM Location: Artifact [com.agilejava.docbkx:docbkx-maven-plugin:pom:2.0.10]
Validation Messages:

    [0]  'dependencies.dependency.version' is missing for com.agilejava.docbkx:docbkx-maven-base:jar

Reason: Failed to validate POM for project com.agilejava.docbkx:docbkx-maven-plugin at Artifact [com.agilejava.docbkx:docbkx-maven-plugin:pom:2.0.10]

Instead, I used the original pom.xml file and replaced the jar file in the repository. 
cp ${APPSCALE_HOME}/AppDB/hadoop/hadoop-${HADOOP_VER}/hadoop-${HADOOP_VER}-core.jar  ~.m2/repository/org/apache/hadoop/hadoop-core/0.20.3-append-r964955-1240/hadoop-core-0.20.3-append-r964955-1240.jar

That's a hack, but it works. Now HBase picks up my version of the hadoop jar. I'll try doing it the right way some other time.

Furthermore, there was an incompatibility with the newer version of HBase because column families were returning ":" appended to each one. Stripping off that last character was the last step to upgrading to the newest HBase version for AppScale. 

Hadoop Cluster Stalls on Startup

Our diskless boot hadoop cluster had a very weird problem during startup. After running the start-all script, all the logs would stop on 

2010-09-20 15:25:26,061 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50030
2010-09-20 15:25:26,061 INFO org.mortbay.log: jetty-6.1.14

And nothing would come up for some time. After a while we may have been lucky and saw some nodes come up. Also, if the filesystem was to be touched on a particular node that node would all the sudden come online. After some online searching by my co-worker, Brian Batinich, found that the problem was the following: 

Line 504 of
${HADOOP_HOME}/src/hdfs/org/apache/hadoop/hdfs/server/datanode/DataNode.java
makes a call to a SecureRandom object to get a random number to use as
part of the storage ID
string for the DataNode.  The random number is used with the IP
address of the node and the current time
to 'guarantee' a unique string for the DataNode.  According to the
Java API, the SecureRandom object
"must" produce non-deterministic output.  The hang is a result of the
SecureRandom object waiting for
the kernel's entropy pool to fill up with strong enough random data.
There isn't a time limit for the object,
so it will just wait forever if it has to.

The jdk1.5 version
that is bundled with RHEL is unaffected by
the SecureRandom hang and that jdk1.6 requires the random number
generator as called by the Hadoop code.
I've tried the test with updates 12,20, and 21 of the jdk1.6 and they
all give the same result.

The fix was to run the following line to produce the random numbers needed:
/sbin/rngd -r /dev/urandom -o /dev/random -f -t 1 &

 

Once this was running on all the machines the namenode, jobtracker, and friends came up within a couple of seconds.

Getting the Most Out of Your Hardware for your MapReduce Job

When trying to squeeze out the most of your hardware you will want to either hit one of three bottlenecks: CPU, disk, or memory. I’m currently configuring a cluster with Fusion-IO drives that can give up to 800MB/s for reading and writing. I want to run my jobs and either hit 200k reads and writes per second, or max out my CPUs. I get this information from running "iostat -x 5", which gives statistics every five seconds. 
I've been under-utilizing the Fusion drives, getting around 30k writes per second. Yet my CPU utilization is less than 50% (each node has 16 cores).  I went ahead and increased the number of mappers per machine incrementally up to 24 mappers. The problem I see is a slew of errors and warnings coming out of Hadoop. 

“Unable to create a new native thread”
“Could not obtain block”
“Task process exit with non-zero status”
 

This is because I've run out of memory (each machine has 24 gigs of ram) and my system does not have swap. TaskTrackers slowly start getting black listed and the job completely fails. What sucks is that the best configuration is very job dependent. Some jobs are CPU bound while others are IO bound. In my case all jobs are memory bound.

With 20 or so mappers per node iostat now shows that I'm doing over 60k read and write requests on the drives and a CPU utilization of 75 percent. 

Child JVM Heap Memory
I was also running into some issues with child JVM's not having enough heap space when running the wordcount benchmark. This was solved by setting 
  <property>
        <name>mapred.child.java.opts</name>
        <value>-Xmx512M</value>
    </property>
in the mapred-site.xml configuration file. 256MB for the sort example looks to be enough though.

Trying to build Hive

I had a problem setting up Hive using ant 1.6.5. The problem seemed similar to when I was trying to build Hadoop with an old version of ant. I pointed it to my local 1.8.1 version of ant.

Using the old version:
ant package
Buildfile: build.xml

BUILD FAILED
/home/user/hive/hive/build.xml:52: Class org.apache.tools.ant.taskdefs.ConditionTask doesn't support the nested "matches" element.
Total time: 0 seconds

Using the new version:
${ANT_HOME}/bin/ant package

A new problem occurred with the error output being
Buildfile: /home/user/hive/build.xml


install-hadoopcore-internal:

build_shims:
     [echo] Compiling shims against hadoop 0.17.2.1 (/home/user/hive/build/hadoopcore/hadoop-0.17.2.1)
    [javac] /home/user/hive/shims/build.xml:48: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
    [javac] Compiling 5 source files to /home/user/hive/build/shims/classes
    [javac] /home/user/hive/shims/src/0.17/java/org/apache/hadoop/hive/shims/Hadoop17Shims.java:105: method does not override a method from its superclass
    [javac]   @Override
    [javac]    ^
    [javac] /home/user/hive/shims/src/0.17/java/org/apache/hadoop/hive/shims/Hadoop17Shims.java:118: method does not override a method from its superclass
    [javac]   @Override
    [javac]    ^
    [javac] /home/user/hive/shims/src/0.17/java/org/apache/hadoop/hive/shims/Hadoop17Shims.java:131: method does not override a method from its superclass
    [javac]   @Override
    [javac]    ^
    [javac] 3 errors

BUILD FAILED
/home/user/hive/build.xml:162: The following error occurred while executing this line:

/home/user/hive/build.xml:105: The following error occurred while executing this line:
/home/user/hive/shims/build.xml:57: The following error occurred while executing this line:
/home/user/hive/shims/build.xml:48: Compile failed; see the compiler error output for details.

Total time: 8 seconds

It looked to be a problem with the version of java I was using. I set my JAVA_HOME path to point to a new version of java. 
export JAVA_HOME="/usr/global/jdk/jdk1.6.0_20/"

 And I reran my previous command
${ANT_HOME}/bin/ant package
BUILD SUCCESSFUL
Total time: 2 minutes 51 seconds

Now the Hive build was a great success =)

Patching and building Hadoop


Patching Hadoop
The problem I was running into was a NumberFormat exception. The exception was being tossed when a TaskTracker would try to parse the output of “df –k” and an entry for the used percentage was returning ‘-’. Also the amount used was giving a negative number. This is clearly some issue with either df or the OS, but I had to circumvent it by patching hadoop rather than reformatting the drive or what have you.
Here is a link to someone with the same issue:

The code change was in ${HADOOP_DIR}/src/core/org/apache/hadoop/fs/DF.java
I made the following changes:
  ...
  if(this.used < 0){
    this.used = this.used * -1;
  }
  this.available = Long.parseLong(tokens.nextToken()) * 1024;
  try{ 
     this.percentUsed = Integer.parseIn(tokens.nextToken());
  } 
  catch(NumberFormatException nfe){
    this.percentUsed = 
  }
   ...

Building Hadoop

The machine I was running had an old version of ant. 

BUILD FAILED
/usr/global/hadoop/hadoop-0.20.2-dit/build.xml:1624: Class org.apache.tools.ant.taskdefs.ConditionTask doesn't support the nested "typefound" element.
I went and downloaded a new version from the ant website.

I untared the tarball then set ANT_HOME to the top level directory of where it was untared. In the top level Hadoop folder I ran “${ANT_HOME}/bin/ant jar”
The build was successful. I backed up the old core jar file and replaced it with the same name from the new core file in the build directory. The new one will have a different name with a newer version and “dev”. 

1 of 1


Posterous theme by Cory Watilo