I've recently had to build a Hadoop cluster for a class in information retrieval. My final project involved building a Hadoop cluster.
Here are some of my notes on configuring the nodes in the cluster.
These links on configuring a single node cluster and multi node cluster were the most helpful.
I downloaded the latest Hadoop distribution then moved it into /hadoop. I had problems with this latest distribution (v.21) so I used v.20 instead.
Here are the configuration files I changed:
core-site.xml:
hadoop-env.sh:
hdfs-site.xml:
mapred-site.xml:
masters:
slaves:
Be sure to enable password-less ssh between master and slaves. Use this command to create an SSH key with an empty password:
Enable password-less ssh login for the master to itself:
Then copy id_rsa.pub to each slave and do the same with each slave's authorized_keys file.
I ran into a few errors along the way. Here is an error that gave me a lot of trouble in the datanode logs:
The solution was to use hostnames every time I referenced a host, either itself or a remote host. I set a host's own name in /etc/hostname and the others in /etc/hosts. I used these hostnames in /hadoop/conf/masters, slaves, and the various conf files.
Every so often I ran into this error in the datanode logs:
I fixed this by deleting tmp/dfs/data on the datanodes where I saw the error. Unfortunately, I had to reformat the HDFS volume after I did this.
I had to raise the ulimit for open files. On Ubuntu nodes I edited /etc/security/limits.conf:
For OS X nodes I just edited ~/.profile:
I ran into this error when copying data into HDFS:
The solution was simply to wait for the datanode to start up. I usually saw the error when I immediately copied data into HDFS after starting the cluster.
Port 50070 on the namenode gave me a Web UI to tell me how many nodes were in the cluster. This was very useful.
Here are some of my notes on configuring the nodes in the cluster.
These links on configuring a single node cluster and multi node cluster were the most helpful.
I downloaded the latest Hadoop distribution then moved it into /hadoop. I had problems with this latest distribution (v.21) so I used v.20 instead.
Here are the configuration files I changed:
core-site.xml:
fs.default.name hdfs://master:9000 hadoop.tmp.dir /hadoop/tmp A base for other temporary directories.
hadoop-env.sh:
# Variables required by Mahout export HADOOP_HOME=/hadoop export HADOOP_CONF_DIR=/hadoop/conf export MAHOUT_HOME=/Users/rpark/mahout PATH=/hadoop/bin:/Users/rpark/mahout/bin:$PATH # The java implementation to use. Required. export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home
hdfs-site.xml:
dfs.replication 3
mapred-site.xml:
mapred.job.tracker master:9001
masters:
master
slaves:
master slave1 slave2 slave3 slave4
Be sure to enable password-less ssh between master and slaves. Use this command to create an SSH key with an empty password:
ssh-keygen -t rsa -P ""
Enable password-less ssh login for the master to itself:
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Then copy id_rsa.pub to each slave and do the same with each slave's authorized_keys file.
I ran into a few errors along the way. Here is an error that gave me a lot of trouble in the datanode logs:
2011-05-08 01:04:30,032 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_1804860059826635300_1001 received exception org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_1804860059826635300_1001 is valid, and cannot be written to.
The solution was to use hostnames every time I referenced a host, either itself or a remote host. I set a host's own name in /etc/hostname and the others in /etc/hosts. I used these hostnames in /hadoop/conf/masters, slaves, and the various conf files.
Every so often I ran into this error in the datanode logs:
... ERROR org.apache.hadoop.dfs.DataNode: java.io.IOException: Incompatible namespaceIDs in /app/hadoop/tmp/dfs/data: namenode namespaceID = 308967713; datanode namespaceID = 113030094 at org.apache.hadoop.dfs.DataStorage.doTransition(DataStorage.java:281) at org.apache.hadoop.dfs.DataStorage.recoverTransitionRead(DataStorage.java:121) at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:230) at org.apache.hadoop.dfs.DataNode.(DataNode.java:199) at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:1202) at org.apache.hadoop.dfs.DataNode.run(DataNode.java:1146) at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:1167) at org.apache.hadoop.dfs.DataNode.main(DataNode.java:1326)
I fixed this by deleting tmp/dfs/data on the datanodes where I saw the error. Unfortunately, I had to reformat the HDFS volume after I did this.
I had to raise the ulimit for open files. On Ubuntu nodes I edited /etc/security/limits.conf:
rpark soft nofile 8192 rpark hard nofile 8192
For OS X nodes I just edited ~/.profile:
ulimit -n 8192
I ran into this error when copying data into HDFS:
could only be replicated to 0 nodes, instead of 1
The solution was simply to wait for the datanode to start up. I usually saw the error when I immediately copied data into HDFS after starting the cluster.
Port 50070 on the namenode gave me a Web UI to tell me how many nodes were in the cluster. This was very useful.
https://mail.google.com/mail/u/0/#starred/FMfcgzGpHHTQTrcKQdDVTmzRJRJjJmQd
ReplyDelete