Monday, May 16, 2011

Building a Hadoop cluster

I've recently had to build a Hadoop cluster for a class in information retrieval. My final project involved building a Hadoop cluster.

Here are some of my notes on configuring the nodes in the cluster.

These links on configuring a single node cluster and multi node cluster were the most helpful.

I downloaded the latest Hadoop distribution then moved it into /hadoop. I had problems with this latest distribution (v.21) so I used v.20 instead.

Here are the configuration files I changed:

    A base for other temporary directories.
# Variables required by Mahout
export HADOOP_HOME=/hadoop
export HADOOP_CONF_DIR=/hadoop/conf
export MAHOUT_HOME=/Users/rpark/mahout

# The java implementation to use.  Required.
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home





Be sure to enable password-less ssh between master and slaves. Use this command to create an SSH key with an empty password:
ssh-keygen -t rsa -P ""

Enable password-less ssh login for the master to itself:
cat $HOME/.ssh/ >> $HOME/.ssh/authorized_keys

Then copy to each slave and do the same with each slave's authorized_keys file.

I ran into a few errors along the way. Here is an error that gave me a lot of trouble in the datanode logs:
2011-05-08 01:04:30,032 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_1804860059826635300_1001 received exception org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_1804860059826635300_1001 is valid, and cannot be written to.

The solution was to use hostnames every time I referenced a host, either itself or a remote host. I set a host's own name in /etc/hostname and the others in /etc/hosts. I used these hostnames in /hadoop/conf/masters, slaves, and the various conf files.

Every so often I ran into this error in the datanode logs:
... ERROR org.apache.hadoop.dfs.DataNode: Incompatible namespaceIDs in /app/hadoop/tmp/dfs/data: namenode namespaceID = 308967713; datanode namespaceID = 113030094
        at org.apache.hadoop.dfs.DataStorage.doTransition(
        at org.apache.hadoop.dfs.DataStorage.recoverTransitionRead(
        at org.apache.hadoop.dfs.DataNode.startDataNode(
        at org.apache.hadoop.dfs.DataNode.(
        at org.apache.hadoop.dfs.DataNode.makeInstance(
        at org.apache.hadoop.dfs.DataNode.createDataNode(
        at org.apache.hadoop.dfs.DataNode.main(

I fixed this by deleting tmp/dfs/data on the datanodes where I saw the error. Unfortunately, I had to reformat the HDFS volume after I did this.

I had to raise the ulimit for open files. On Ubuntu nodes I edited /etc/security/limits.conf:
rpark  soft nofile  8192
rpark  hard nofile  8192

For OS X nodes I just edited ~/.profile:
ulimit -n 8192

I ran into this error when copying data into HDFS:
could only be replicated to 0 nodes, instead of 1

The solution was simply to wait for the datanode to start up. I usually saw the error when I immediately copied data into HDFS after starting the cluster.

Port 50070 on the namenode gave me a Web UI to tell me how many nodes were in the cluster. This was very useful.

Working with VMware vShield REST API in perl

Here is an overview of how to use perl code to work with VMware's vShield API.

vShield App and Edge are two security products offered by VMware. vShield Edge has a broad range of functionality such as firewall, VPN, load balancing, NAT, and DHCP. vShield App is a NIC-level firewall for virtual machines.

We'll focus today on how to use the API to programatically make firewall rule changes. Here are some of the things you can do with the API:
  • List the current firewall ruleset
  • Add new rules
  • Get a list of past firewall revisions
  • Revert back to a previous ruleset revision
vShield API documentation is available here.

Before we get into the API itself, let's look at what the firewall ruleset looks like. It's formatted as XML:


Application type="UNICAST">LDAP over SSL



Here are some notes about the XML configuration:
  • The API works mainly with container objects. A container can range from a datacenter or cluster all the way down to a port group or IP address.
  • Every container object must be listed in the <containerassociation> section.
  • Container objects have instance IDs. The instance ID is also referred to as the managed object ID (MOID)
  • Every firewall rule has its own ID as well as precedence and position fields.

If you want to edit a firewall ruleset, you must specify which ruleset you want. Every object has its own ruleset. So you could edit a ruleset at the datacenter level, cluster level, etc. all the way down to the port group level.

For simplicity let's work with the ruleset at the datacenter level because this will cover all VMs in that datacenter.

The first thing to do is get the object ID for the datacenter. If you don't already know this then you must look it up. There are two places you can find it:
  1. Use the Managed Object Browser in vCenter Server, located at https://<vcenter IP>/mob, e.g.,
  2. Query your vCenter Server with the vSphere SDK

Either way you must have access to vCenter Server.

Here is some perl code to query vCenter. The code assumes that $dc_name is set to the name of your datacenter. You can find this name in the vSphere client.
use VMware::RunTime;

$ENV{'VI_SERVER'} = $vc_ip;
$ENV{'VI_USERNAME'} = $vc_user;
$ENV{'VI_PASSWORD'} = $vc_pass;

# read/validate options and connect to the server

$view = Vim::find_entity_views(view_type => 'Datacenter');
foreach $datacenter (@$view) {
 if (lc($datacenter->{name}) eq lc($dc_name)) {
  return $datacenter->{mo_ref}->{value};

return "Not_found";

Note that this code requires you include the .pm files from the perl SDK. You can find these in lib/VMware/share/VMware/ in the perl SDK tarball.

Once you have the object ID for your datacenter, you can use it to create the vShield URL that you will need to access the datacenter's firewall ruleset:
$url = "https://" . $vsm_ip . "/api/1.0/zones/" . $moid . "/firewall/rules";
Note that this URL is to access the ruleset in vShield App. If you want to access the ruleset in vShield Edge instead, simply change the "zones" in the URL to "network". So the resulting vShield Edge URL looks like this:
$url = "https://" . $vsm_ip . "/api/1.0/network/" . $moid . "/firewall/rules";
Now that you have the URL, use it to get the ruleset with a simple HTTP GET using Basic Authentication:
$ua = LWP::UserAgent->new;
$request = HTTP::Request->new(GET =>$url);
$request->authorization_basic($vsm_user, $vsm_pass);
$response = $ua->request($request);

$response now contains the XML ruleset. Copy it to a variable such as $ruleset and use your favorite XML library to work directly with each rule. I found that using XML::LibXML provides the best routines for both parsing and editing the XML.

This code iterates through each rule, loading the source address and protocol into variables.
my $parser = new XML::LibXML;
my $tree = $parser->parse_string($ruleset);
my $root = $tree->documentElement();

foreach my $rule_ref ($root->findnodes('RuleSet/Rule')) {
 $rule_src = $rule_ref->findvalue('Source/@ref');
 $rule_prot = $rule_ref->findvalue('Protocol');

Note that the source address is accessible as an attribute named "ref" in the source tag. XPath syntax uses '@' to access XML attributes.

The vShield API has certain restrictions when it comes to adding firewall rules. You can't just add a rule to the existing ruleset. Every time you update the ruleset with new rules, you replace all of the old rules.

The proper way to add a rule is to load the existing rules into memory as an XML tree, add the new rules to the tree, then post the updated tree back as the new ruleset.

This sample code illustrates how to add a new rule. Note that only a few of the fields are included here but every field in the rule is required, with the exception of Notes. You will get an error if you leave out a required field.
my $rule_ref = XML::LibXML::Element->new("Rule");

my $id_el = XML::LibXML::Element->new("ID");

my $src_el = XML::LibXML::Element->new("Source");
$src_el->setAttribute("ref", $src_ip);
$src_el->setAttribute("exclude", "false");


my $rule_root = $root->findnodes('RuleSet')->get_node(1);

Here are some notes:
  • To add a new rule, specify an ID of 0. When vShield adds the new rule to the ruleset, it will automatically generate a new ID.
  • The Position field is required but you can set it to any value. I set it to a default of 50. vShield Manager rewrites this field every time you move rules around in the vShield GUI.
After you add the rule to the XML tree, you must also add a new container object for the IP addresses referenced by the rule:
my $contain_root = $root->findnodes('ContainerAssociation')->get_node(1);

my $contain_el = XML::LibXML::Element->new("Container");
$contain_el->setAttribute("id", $ip_addr);
my $ip_addr_el = XML::LibXML::Element->new("IPAddress");

When you're done updating the XML tree, post the complete ruleset:
$ua = LWP::UserAgent->new;
$request = HTTP::Request->new(POST=>$self->{url});
$request->authorization_basic($self->{vsm_user}, $self->{vsm_pass});
$response = $ua->request($request);