Skip to main content

Building a Hadoop cluster

I've recently had to build a Hadoop cluster for a class in information retrieval. My final project involved building a Hadoop cluster.

Here are some of my notes on configuring the nodes in the cluster.

These links on configuring a single node cluster and multi node cluster were the most helpful.

I downloaded the latest Hadoop distribution then moved it into /hadoop. I had problems with this latest distribution (v.21) so I used v.20 instead.

Here are the configuration files I changed:

core-site.xml:
  
    fs.default.name
    hdfs://master:9000
  
  
    hadoop.tmp.dir
    /hadoop/tmp
    A base for other temporary directories.
  

hadoop-env.sh:
# Variables required by Mahout
export HADOOP_HOME=/hadoop
export HADOOP_CONF_DIR=/hadoop/conf
export MAHOUT_HOME=/Users/rpark/mahout
PATH=/hadoop/bin:/Users/rpark/mahout/bin:$PATH

# The java implementation to use.  Required.
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home

hdfs-site.xml:
  
    dfs.replication
    3
  

mapred-site.xml:
  
    mapred.job.tracker
    master:9001
  

masters:
master

slaves:
master
slave1
slave2
slave3
slave4

Be sure to enable password-less ssh between master and slaves. Use this command to create an SSH key with an empty password:
ssh-keygen -t rsa -P ""

Enable password-less ssh login for the master to itself:
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Then copy id_rsa.pub to each slave and do the same with each slave's authorized_keys file.

I ran into a few errors along the way. Here is an error that gave me a lot of trouble in the datanode logs:
2011-05-08 01:04:30,032 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_1804860059826635300_1001 received exception org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_1804860059826635300_1001 is valid, and cannot be written to.

The solution was to use hostnames every time I referenced a host, either itself or a remote host. I set a host's own name in /etc/hostname and the others in /etc/hosts. I used these hostnames in /hadoop/conf/masters, slaves, and the various conf files.

Every so often I ran into this error in the datanode logs:
... ERROR org.apache.hadoop.dfs.DataNode: java.io.IOException: Incompatible namespaceIDs in /app/hadoop/tmp/dfs/data: namenode namespaceID = 308967713; datanode namespaceID = 113030094
        at org.apache.hadoop.dfs.DataStorage.doTransition(DataStorage.java:281)
        at org.apache.hadoop.dfs.DataStorage.recoverTransitionRead(DataStorage.java:121)
        at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:230)
        at org.apache.hadoop.dfs.DataNode.(DataNode.java:199)
        at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:1202)
        at org.apache.hadoop.dfs.DataNode.run(DataNode.java:1146)
        at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:1167)
        at org.apache.hadoop.dfs.DataNode.main(DataNode.java:1326)

I fixed this by deleting tmp/dfs/data on the datanodes where I saw the error. Unfortunately, I had to reformat the HDFS volume after I did this.

I had to raise the ulimit for open files. On Ubuntu nodes I edited /etc/security/limits.conf:
rpark  soft nofile  8192
rpark  hard nofile  8192

For OS X nodes I just edited ~/.profile:
ulimit -n 8192

I ran into this error when copying data into HDFS:
could only be replicated to 0 nodes, instead of 1

The solution was simply to wait for the datanode to start up. I usually saw the error when I immediately copied data into HDFS after starting the cluster.

Port 50070 on the namenode gave me a Web UI to tell me how many nodes were in the cluster. This was very useful.

Comments

  1. https://mail.google.com/mail/u/0/#starred/FMfcgzGpHHTQTrcKQdDVTmzRJRJjJmQd

    ReplyDelete

Post a Comment

Popular posts from this blog

Working with VMware vShield REST API in perl

Here is an overview of how to use perl code to work with VMware's vShield API. vShield App and Edge are two security products offered by VMware. vShield Edge has a broad range of functionality such as firewall, VPN, load balancing, NAT, and DHCP. vShield App is a NIC-level firewall for virtual machines. We'll focus today on how to use the API to programatically make firewall rule changes. Here are some of the things you can do with the API: List the current firewall ruleset Add new rules Get a list of past firewall revisions Revert back to a previous ruleset revision vShield API documentation is available here . Before we get into the API itself, let's look at what the firewall ruleset looks like. It's formatted as XML: 1.1.1.1/32 10.1.1.1/32 datacenter-2 ANY 1023 High 1 ANY < Application type="UNICAST">LDAP over SSL 636 TCP ALLOW deny 1020 Low 3 ANY IMAP 143 TCP < Action>ALLOW false Here are so

Connecting to SQL Server from OS X perl

I've been spending my coding time in the offhours working on Perl instead of Ruby. My coding time in general has been very limited, which is part of the reason for the length of time between updates. :) My latest project is to pull data out of a Microsoft SQL Server database for analysis. I'm using perl for various reasons: I need a crossplatform environment, and I need certain libraries that only work on perl. Some of the target users for my code run on Windows. I know that Ruby runs on Windows but it's not the platform of choice for Ruby developers. The vast majority seem to develop either on OS X or Linux. So Ruby on Windows isn't at the maturity that ActiveState perl is on Windows. In fact, I don't even run native perl anymore on my MacBook Pro. I've switched over to ActiveState perl because I don't need to compile anything every time I want to install new CPAN libraries. And because it's ActiveState, I'm that much more confident it will w