Skip to main content

Using chef to build out a Hadoop cluster

After not posting for a while, I have about 3-4 posts that I'd like to get out there. The first is about using chef to build a Hadoop cluster.

Chef is a configuration management tool that allows one to automate the process of provisioning servers. I had to create a Hadoop cluster of 4-5 servers and I wanted to use this opportunity to automate the process with chef.

I had to perform a series of the same steps on these Linux nodes:
  • Install ruby and chef
  • Install Sun Java
  • Install VMware Tools
  • Install NTP
  • Add its hostname to a shared /etc/hosts file
  • Configure passwordless ssh login

Installing Chef and Ruby

I followed the steps in this link.

The first step is to sign up for a Hosted Chef account on the Opscode site. An account is free for 5 nodes or less. Perform the following steps:
  1. Create a new organization
  2. Select "Generate knife config" to download knife.rb
  3. Select "Regenerate validation key" to download (validator).pem

  • Click on your account and click "get private key" to download (private key).pem
  • Then install ruby and chef on your first host. Once you do the first host, you can quickly bootstrap the others. Install Ruby:
    sudo apt-get update
    sudo apt-get install ruby ruby-dev libopenssl-ruby rdoc ri irb build-essential wget ssl-cert git-core
    Install rubygems:
    cd /tmp
    tar zxf rubygems-1.8.10.tgz
    cd rubygems-1.8.10
    sudo ruby setup.rb --no-format-executable
    Install chef:
    sudo gem install chef
    cd ~
    git clone
    mkdir -p ~/chef-repo/.chef
    cp (private key).pem ~/chef-repo/.chef
    cp (validator).pem ~/chef-repo/.chef
    cp knife.rb ~/chef-repo/.chef
    Connect to Hosted Chef and configure workstation as a client:
    cd ~/chef-repo
    knife configure client ./client-config
    sudo mkdir /etc/chef
    sudo cp -r ~/chef-repo/client-config/* /etc/chef
    sudo chef-client
    Once the client is installed on the first host, you can bootstrap the clients on the other hosts by using this command, as described here. Bootstrap the other clients. This assumes you have created a user called hadoop who is the main hadoop user.
    knife bootstrap (node IP) -x hadoop -P (password) --sudo
    Repeat this for all of your other chef nodes.

    Installing some Chef recipes

    Now that chef is installed on all the nodes, it's time to run some chef recipes. A recipe is a set of configuration instructions. In my case, I want to install some packages. I started with VMware Tools, Sun Java, and NTP.
    Start by creating a new cookbook:
    knife cookbook create MYCOOKBOOK
    Then download some existing cookbooks from the Chef Repository.
    knife cookbook site install vmtools
    knife cookbook site install java
    knife cookbook site install ntp
    Add these recipes to each node's run list:
    knife node run_list add NODE_NAME "recipe[java:sun]"
    knife node run_list add NODE_NAME "recipe[vmtools]"
    knife node run_list add NODE_NAME "recipe[ntp]"
    You'll then need to run "sudo chef-client" on each node to execute the run list and install these packages.

    Populate /etc/hosts

    The next step is to create a recipe that will populate the /etc/hosts file from the Chef repository. One of Hadoop's requirements is to store the name-IP mapping for every node in the cluster in /etc/hosts. The easiest way to do this is to populate /etc/hosts from the list of hosts that Chef knows about.
    So start by creating your new recipe in your cookbook. I call it "hosts":
    knife cookbook create hosts
    Your cookbook will now have a subdirectory called hosts with some skeleton files already created. Create your default ruby script in hosts/recipes/default.rb:
    # Gets list of names from all nodes in repository and rewrites /etc/hosts
    hosts = {}
    localhost = nil
    search(:node, "name:*", %w(ipaddress fqdn)) do |n|
     hosts[n["ipaddress"]] = n
    template "/etc/hosts" do
     source "hosts.erb"
     mode 0644
     variables(:hosts => hosts)
    Now edit the hosts.erb file, stored in hosts/templates/default/hosts.erb:
    hosts/templates/default/hosts.erb: localhost
    <% @hosts.keys.sort.each do |ip| %>
    <%= ip %> <%= @hosts[ip]["fqdn"] %>
    <% end %>
    Now deploy it to all your hosts:
    knife node run_list add NODE_NAME "recipe[hosts]"
    Don't forget to run "sudo chef-client" on each node.
    You should also upload this recipe to the Chef server:
    knife cookbook upload hosts

    Installing passwordless ssh login

    A Hadoop cluster requires passwordless ssh login between the master and its slave nodes. The easiest way to do this is to have each node create its own SSH keys with an empty password, and then copy the public keys for all nodes to the master node.

    So create a recipe to create the SSH key with empty password. I call it "sshlogin":
    knife cookbook create sshlogin
    Create your default ruby script in sshlogin/recipes/default.rb:
    # Create empty RSA password
    execute "ssh-keygen" do
      command "sudo -u hadoop ssh-keygen -q -t rsa -N '' -f /home/hadoop/.ssh/id_rsa"
      creates "/home/hadoop/.ssh/id_rsa"
      action :run
    # Copy public key to node1; if key doesn't exist in authorized_keys, append it to this file
    execute <<EOF
    cat /home/hadoop/.ssh/ | sudo -u hadoop ssh hadoop@node1 "(cat > /tmp/tmp.pubkey; mkdir -p .ssh; touch .ssh/authorized_keys; grep #{node[:fqdn]} .ssh/authorized_keys > /dev/null || cat /tmp/tmp.pubkey >> .ssh/authorized_keys; rm /tmp/tmp.pubkey)"
    Note that when you run this recipe on each host, it will prompt you to type the password of node1 each time because you are essentially scp'ing the key to this master node.
    Now you can deploy this recipe:
    knife cookbook upload sshlogin
    knife node run_list add node2 "recipe[sshlogin]"
    Type this command to run the recipes on each host:
    sudo chef-client


    Popular posts from this blog

    Building a Hadoop cluster

    I've recently had to build a Hadoop cluster for a class in information retrieval . My final project involved building a Hadoop cluster. Here are some of my notes on configuring the nodes in the cluster. These links on configuring a single node cluster and multi node cluster were the most helpful. I downloaded the latest Hadoop distribution then moved it into /hadoop. I had problems with this latest distribution (v.21) so I used v.20 instead. Here are the configuration files I changed: core-site.xml: hdfs://master:9000 hadoop.tmp.dir /hadoop/tmp A base for other temporary directories. # Variables required by Mahout export HADOOP_HOME=/hadoop export HADOOP_CONF_DIR=/hadoop/conf export MAHOUT_HOME=/Users/rpark/mahout PATH=/hadoop/bin:/Users/rpark/mahout/bin:$PATH # The java implementation to use. Required. export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home hdfs-site

    Creating a Hackintosh

    I've always wanted to create a "Hackintosh", i.e. a standard PC that runs OS X. My PC is over 5 years old so it was time for a refresh. I figured this was the best time to give the Hackintosh a go. Hardware CPU:  Intel Quad Core i7 4790 3.6 Ghz Motherboard:  GIGABYTE GA-Z97-HD3 Audio:  ALC 887 Network: Realtek 8111F-VL Network Card:  4 Antennas 802.11ac WiFi BCM94360CD Wireless Network Card Graphics Card:  nVidia 750 GTX Memory:  Corsair Vengeance DDR3-1600 32 GB (4x8 GB) Hard Drive : Seagate ST3000DM001 3 TB SATA3 7200 rpm DVD:  Samsung SH-224DB 24X BIOS Changes The first step was to change the BIOS settings to support OS X. Disabling VT-d is the only setting that is clearly required; the others are questionable but were done by others so I thought they were worth trying. F7  to load Optimized Defaults M.I.T. Advanced Frequency Settings Extreme Memory Profile (X.M.P.): Enabled Miscellaneous Settings PCIe Slot Configuration:  Gen

    Connecting to SQL Server from OS X perl

    I've been spending my coding time in the offhours working on Perl instead of Ruby. My coding time in general has been very limited, which is part of the reason for the length of time between updates. :) My latest project is to pull data out of a Microsoft SQL Server database for analysis. I'm using perl for various reasons: I need a crossplatform environment, and I need certain libraries that only work on perl. Some of the target users for my code run on Windows. I know that Ruby runs on Windows but it's not the platform of choice for Ruby developers. The vast majority seem to develop either on OS X or Linux. So Ruby on Windows isn't at the maturity that ActiveState perl is on Windows. In fact, I don't even run native perl anymore on my MacBook Pro. I've switched over to ActiveState perl because I don't need to compile anything every time I want to install new CPAN libraries. And because it's ActiveState, I'm that much more confident it will w