Building a Hadoop cluster

I've recently had to build a Hadoop cluster for a class in information retrieval. My final project involved building a Hadoop cluster.

Here are some of my notes on configuring the nodes in the cluster.

These links on configuring a single node cluster and multi node cluster were the most helpful.

I downloaded the latest Hadoop distribution then moved it into /hadoop. I had problems with this latest distribution (v.21) so I used v.20 instead.

Here are the configuration files I changed:

core-site.xml:

  
    fs.default.name
    hdfs://master:9000
  
  
    hadoop.tmp.dir
    /hadoop/tmp
    A base for other temporary directories.

hadoop-env.sh:

# Variables required by Mahout
export HADOOP_HOME=/hadoop
export HADOOP_CONF_DIR=/hadoop/conf
export MAHOUT_HOME=/Users/rpark/mahout
PATH=/hadoop/bin:/Users/rpark/mahout/bin:$PATH

# The java implementation to use.  Required.
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home

hdfs-site.xml:

  
    dfs.replication
    3

mapred-site.xml:

  
    mapred.job.tracker
    master:9001

masters:

master

slaves:

master
slave1
slave2
slave3
slave4

Be sure to enable password-less ssh between master and slaves. Use this command to create an SSH key with an empty password:

ssh-keygen -t rsa -P ""

Enable password-less ssh login for the master to itself:

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Then copy id_rsa.pub to each slave and do the same with each slave's authorized_keys file.

I ran into a few errors along the way. Here is an error that gave me a lot of trouble in the datanode logs:

2011-05-08 01:04:30,032 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_1804860059826635300_1001 received exception org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_1804860059826635300_1001 is valid, and cannot be written to.

The solution was to use hostnames every time I referenced a host, either itself or a remote host. I set a host's own name in /etc/hostname and the others in /etc/hosts. I used these hostnames in /hadoop/conf/masters, slaves, and the various conf files.

Every so often I ran into this error in the datanode logs:

... ERROR org.apache.hadoop.dfs.DataNode: java.io.IOException: Incompatible namespaceIDs in /app/hadoop/tmp/dfs/data: namenode namespaceID = 308967713; datanode namespaceID = 113030094
        at org.apache.hadoop.dfs.DataStorage.doTransition(DataStorage.java:281)
        at org.apache.hadoop.dfs.DataStorage.recoverTransitionRead(DataStorage.java:121)
        at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:230)
        at org.apache.hadoop.dfs.DataNode.(DataNode.java:199)
        at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:1202)
        at org.apache.hadoop.dfs.DataNode.run(DataNode.java:1146)
        at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:1167)
        at org.apache.hadoop.dfs.DataNode.main(DataNode.java:1326)

I fixed this by deleting tmp/dfs/data on the datanodes where I saw the error. Unfortunately, I had to reformat the HDFS volume after I did this.

I had to raise the ulimit for open files. On Ubuntu nodes I edited /etc/security/limits.conf:

rpark  soft nofile  8192
rpark  hard nofile  8192

For OS X nodes I just edited ~/.profile:

ulimit -n 8192

I ran into this error when copying data into HDFS:

could only be replicated to 0 nodes, instead of 1

The solution was simply to wait for the datanode to start up. I usually saw the error when I immediately copied data into HDFS after starting the cluster.

Port 50070 on the namenode gave me a Web UI to tell me how many nodes were in the cluster. This was very useful.

Thoughts on Product Management

Search This Blog

Building a Hadoop cluster

Comments

Post a Comment

Popular posts from this blog

Connecting to SQL Server from OS X perl

The #1 Mistake Made by Product People at All Levels