Using chef to build out a Hadoop cluster

After not posting for a while, I have about 3-4 posts that I'd like to get out there. The first is about using chef to build a Hadoop cluster.

Chef is a configuration management tool that allows one to automate the process of provisioning servers. I had to create a Hadoop cluster of 4-5 servers and I wanted to use this opportunity to automate the process with chef.

I had to perform a series of the same steps on these Linux nodes:

Install ruby and chef
Install Sun Java
Install VMware Tools
Install NTP
Add its hostname to a shared /etc/hosts file
Configure passwordless ssh login

Installing Chef and Ruby

I followed the steps in this link.

The first step is to sign up for a Hosted Chef account on the Opscode site. An account is free for 5 nodes or less. Perform the following steps:

Create a new organization
Select "Generate knife config" to download knife.rb
Select "Regenerate validation key" to download (validator).pem

Click on your account and click "get private key" to download (private key).pem

Then install ruby and chef on your first host. Once you do the first host, you can quickly bootstrap the others. Install Ruby:

sudo apt-get update
sudo apt-get install ruby ruby-dev libopenssl-ruby rdoc ri irb build-essential wget ssl-cert git-core

Install rubygems:

cd /tmp
wget http://production.cf.rubygems.org/rubygems/rubygems-1.8.10.tgz
tar zxf rubygems-1.8.10.tgz
cd rubygems-1.8.10
sudo ruby setup.rb --no-format-executable

Install chef:

sudo gem install chef
cd ~
git clone https://github.com/opscode/chef-repo.git
mkdir -p ~/chef-repo/.chef
cp (private key).pem ~/chef-repo/.chef
cp (validator).pem ~/chef-repo/.chef
cp knife.rb ~/chef-repo/.chef

Connect to Hosted Chef and configure workstation as a client:

cd ~/chef-repo
knife configure client ./client-config
sudo mkdir /etc/chef
sudo cp -r ~/chef-repo/client-config/* /etc/chef
sudo chef-client

Once the client is installed on the first host, you can bootstrap the clients on the other hosts by using this command, as described here. Bootstrap the other clients. This assumes you have created a user called hadoop who is the main hadoop user.

knife bootstrap (node IP) -x hadoop -P (password) --sudo

Repeat this for all of your other chef nodes.

Installing some Chef recipes

Now that chef is installed on all the nodes, it's time to run some chef recipes. A recipe is a set of configuration instructions. In my case, I want to install some packages. I started with VMware Tools, Sun Java, and NTP.
Start by creating a new cookbook:

knife cookbook create MYCOOKBOOK

Then download some existing cookbooks from the Chef Repository.

knife cookbook site install vmtools
knife cookbook site install java
knife cookbook site install ntp

Add these recipes to each node's run list:

knife node run_list add NODE_NAME "recipe[java:sun]"
knife node run_list add NODE_NAME "recipe[vmtools]"
knife node run_list add NODE_NAME "recipe[ntp]"

You'll then need to run "sudo chef-client" on each node to execute the run list and install these packages.

Populate /etc/hosts

The next step is to create a recipe that will populate the /etc/hosts file from the Chef repository. One of Hadoop's requirements is to store the name-IP mapping for every node in the cluster in /etc/hosts. The easiest way to do this is to populate /etc/hosts from the list of hosts that Chef knows about.
So start by creating your new recipe in your cookbook. I call it "hosts":

knife cookbook create hosts

Your cookbook will now have a subdirectory called hosts with some skeleton files already created. Create your default ruby script in hosts/recipes/default.rb:

# Gets list of names from all nodes in repository and rewrites /etc/hosts
hosts = {}
localhost = nil

search(:node, "name:*", %w(ipaddress fqdn)) do |n|
 hosts[n["ipaddress"]] = n
end

template "/etc/hosts" do
 source "hosts.erb"
 mode 0644
 variables(:hosts => hosts)
end

Now edit the hosts.erb file, stored in hosts/templates/default/hosts.erb:

hosts/templates/default/hosts.erb:
127.0.0.1 localhost
<% @hosts.keys.sort.each do |ip| %>
<%= ip %> <%= @hosts[ip]["fqdn"] %>
<% end %>

Now deploy it to all your hosts:

knife node run_list add NODE_NAME "recipe[hosts]"

Don't forget to run "sudo chef-client" on each node.
You should also upload this recipe to the Chef server:

knife cookbook upload hosts

Installing passwordless ssh login

A Hadoop cluster requires passwordless ssh login between the master and its slave nodes. The easiest way to do this is to have each node create its own SSH keys with an empty password, and then copy the public keys for all nodes to the master node.

So create a recipe to create the SSH key with empty password. I call it "sshlogin":

knife cookbook create sshlogin

Create your default ruby script in sshlogin/recipes/default.rb:

# Create empty RSA password
execute "ssh-keygen" do
  command "sudo -u hadoop ssh-keygen -q -t rsa -N '' -f /home/hadoop/.ssh/id_rsa"
  creates "/home/hadoop/.ssh/id_rsa"
  action :run
end

# Copy public key to node1; if key doesn't exist in authorized_keys, append it to this file
execute <<EOF
cat /home/hadoop/.ssh/id_rsa.pub | sudo -u hadoop ssh hadoop@node1 "(cat > /tmp/tmp.pubkey; mkdir -p .ssh; touch .ssh/authorized_keys; grep #{node[:fqdn]} .ssh/authorized_keys > /dev/null || cat /tmp/tmp.pubkey >> .ssh/authorized_keys; rm /tmp/tmp.pubkey)"
EOF

Note that when you run this recipe on each host, it will prompt you to type the password of node1 each time because you are essentially scp'ing the key to this master node.
Now you can deploy this recipe:

knife cookbook upload sshlogin
knife node run_list add node2 "recipe[sshlogin]"

Type this command to run the recipes on each host:

sudo chef-client

Comments

AnonymousDecember 3, 2022 at 8:31 PM
With it, manufacturers can machine a wide ranging|a panoramic} variation of materials. Also, they can create merchandise in virtually any shape or size, and with virtually any tolerances and degree of intricacy. Another advantage of CNC machining is truth that|the fact that} it could possibly} so quickly produce a large amount of merchandise. Fusion 360 features a powerful and easy to use CAM answer that can control extensive variety|all kinds} of CNC machines Direct CNC including mills, lathes, routers, mill/turn machines, plasma cutters, water jets, and lasers. Fusion 360 also integrates CAD, CAE, and ECAD together with CAM into a single answer to reduce back} information loss and improve course of reliability for CNC machining and different downstream processes.
ReplyDelete
Replies

Add comment

Thoughts on Product Management

Search This Blog