Monday, May 16, 2011

Building a Hadoop cluster

I've recently had to build a Hadoop cluster for a class in information retrieval. My final project involved building a Hadoop cluster.

Here are some of my notes on configuring the nodes in the cluster.

These links on configuring a single node cluster and multi node cluster were the most helpful.

I downloaded the latest Hadoop distribution then moved it into /hadoop. I had problems with this latest distribution (v.21) so I used v.20 instead.

Here are the configuration files I changed:

core-site.xml:
  
    fs.default.name
    hdfs://master:9000
  
  
    hadoop.tmp.dir
    /hadoop/tmp
    A base for other temporary directories.
  

hadoop-env.sh:
# Variables required by Mahout
export HADOOP_HOME=/hadoop
export HADOOP_CONF_DIR=/hadoop/conf
export MAHOUT_HOME=/Users/rpark/mahout
PATH=/hadoop/bin:/Users/rpark/mahout/bin:$PATH

# The java implementation to use.  Required.
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home

hdfs-site.xml:
  
    dfs.replication
    3
  

mapred-site.xml:
  
    mapred.job.tracker
    master:9001
  

masters:
master

slaves:
master
slave1
slave2
slave3
slave4

Be sure to enable password-less ssh between master and slaves. Use this command to create an SSH key with an empty password:
ssh-keygen -t rsa -P ""

Enable password-less ssh login for the master to itself:
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Then copy id_rsa.pub to each slave and do the same with each slave's authorized_keys file.

I ran into a few errors along the way. Here is an error that gave me a lot of trouble in the datanode logs:
2011-05-08 01:04:30,032 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_1804860059826635300_1001 received exception org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_1804860059826635300_1001 is valid, and cannot be written to.

The solution was to use hostnames every time I referenced a host, either itself or a remote host. I set a host's own name in /etc/hostname and the others in /etc/hosts. I used these hostnames in /hadoop/conf/masters, slaves, and the various conf files.

Every so often I ran into this error in the datanode logs:
... ERROR org.apache.hadoop.dfs.DataNode: java.io.IOException: Incompatible namespaceIDs in /app/hadoop/tmp/dfs/data: namenode namespaceID = 308967713; datanode namespaceID = 113030094
        at org.apache.hadoop.dfs.DataStorage.doTransition(DataStorage.java:281)
        at org.apache.hadoop.dfs.DataStorage.recoverTransitionRead(DataStorage.java:121)
        at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:230)
        at org.apache.hadoop.dfs.DataNode.(DataNode.java:199)
        at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:1202)
        at org.apache.hadoop.dfs.DataNode.run(DataNode.java:1146)
        at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:1167)
        at org.apache.hadoop.dfs.DataNode.main(DataNode.java:1326)

I fixed this by deleting tmp/dfs/data on the datanodes where I saw the error. Unfortunately, I had to reformat the HDFS volume after I did this.

I had to raise the ulimit for open files. On Ubuntu nodes I edited /etc/security/limits.conf:
rpark  soft nofile  8192
rpark  hard nofile  8192

For OS X nodes I just edited ~/.profile:
ulimit -n 8192

I ran into this error when copying data into HDFS:
could only be replicated to 0 nodes, instead of 1

The solution was simply to wait for the datanode to start up. I usually saw the error when I immediately copied data into HDFS after starting the cluster.

Port 50070 on the namenode gave me a Web UI to tell me how many nodes were in the cluster. This was very useful.

13 Comments:

Blogger Tanu Chauhan said...

Thanks To Share This Very Useful Information With Us. php training in jalandhar

May 19, 2017 at 3:39 AM  
Blogger santhosh kumar said...

The blog gave me idea to build the hadoop cluster My sincere thanks for sharing this post and please continue to share this post
Hadoop Training in Chennai

June 23, 2017 at 11:40 PM  
Blogger vignesjoseph said...

To Setup docker on your computer. To Serve up a Hadoop cluster utilizing the log/Hadoop image. All preparation can be found here: big data foundation/docker-Hadoop.you will require to start up several cases here, i.e. name-node, data-node, secondary-name-node, yarn.Bang! You got a Hadoop cluster at home.Find your our docker IP. On Mac, you can do (I use Mac) config Ge Tifa DDR en0
If want become a to learn for Java Training.We have to real-time training and 100% job assistance and it's live instructor trained for real-time scenario and they explain about the all latest version update for Java Training Course, to reach us Java Training in Chennai | Java Training Institute in Chennai

June 24, 2017 at 11:29 PM  
Blogger Logavani G said...

really nice blog has been shared by you. before i read this blog i didn't have any knowledge about this but now i got some knowledge. so keep on sharing such kind of an interesting blogs.
hadoop training in chennai

June 27, 2017 at 2:01 AM  
Blogger Krishna Veni said...

Really impressive and informative blog post, thanks for sharing your views and ideas..

Java Training in chennai |
Java Course in chennai | Hadoop Training in chennai

July 12, 2017 at 6:49 AM  
Blogger ranasing rajkumar said...

Well Said, you have provided the right info that will be beneficial to somebody at all time. Thanks for sharing your valuable Ideas to our vision



August 23, 2017 at 1:01 AM  
Blogger Ishu Sathya said...


Very informative!! Thank you for this nice blog on JAVA programming language.


Java Training |
Java Courses in Chennai

September 18, 2017 at 3:09 AM  
Blogger Anoushka Sakthi said...

Wonderful Blog!!! Your post is very informative about the latest technology. Thank you for sharing the article with us.

Best Hadoop Training in Chennai |
Hadoop Training in Chennai

September 19, 2017 at 3:38 AM  
Blogger jhansi joe said...

I found this content is useful to me to learn something advanced, thanks admin for your informative post :)
Regards,
Best JAVA Training in Chennai|Best JAVA Training institute in Chennai

September 20, 2017 at 3:15 AM  
Blogger Shalini Mudhalayar said...


Wonderful information on recent technology. Keep following my profile to know about the Software courses like Selenium testing.

Selenium Training in Velachery |
Selenium Course in Chennai

September 20, 2017 at 5:05 AM  
Blogger jhansi joe said...

Great and impressive article!! Got to learn and know more about web development. To know more refer create website for much more unique ideas.PHP Training in Chennai | Best PHP training in Chennai

September 25, 2017 at 5:24 AM  
Blogger ALINAAMEL said...

Awesome blog. I enjoyed reading your articles. This is truly a great read for me.
Suchmaschinenoptimierung in L├╝denscheid

October 6, 2017 at 5:30 AM  
Blogger srihariparu said...

Your Blog is really wonderful..I have read your article,its very useful to us..keep updating..
Final Year Project Center in Chennai | IEEE Project Center in Chennai

October 12, 2017 at 6:57 AM  

Post a Comment

Subscribe to Post Comments [Atom]

<< Home