Hadoop/HBase automated deployment using Puppet

with 27 comments


Deploying and configuring Hadoop and HBase across clusters is a complex task. In this article I will show what we do to make it easier, and share the deployment recipes that we use.

For the tl;dr crowd: go get the code here.

Cool tools

Before going into how we do things, here is the list of tools that we are using, and which I will mention in this article. I will try to put a link next to any tool-specific term, but you can always refer to its specific home-page for further reference.

  • Hudson – this is a great CI server, and we are using it to build Hadoop, HBase, Zookeeper and more
  • The Hudson Promoted Builds Plug-in – allows defining operations that run after the build has finished, manually or automatically
  • Puppet – configuration management tool We don’t have a dedicated operations team to hand off a list of instructions on how we want our machines to look like. The operations team helping us just makes sure the servers are in the rack, networked and powered up, but once we have a set of IPs (usually from IPMI cards) we’re good to go ourselves. We are our own devops team, and as such we try to automate as much as possible, where possible, and using the tools above helps a lot.


When we started with hstack, we did everything manually – created users, tweaked system properties and deployed Hadoop/HBase. By far the most time consuming part was making sure the configuration files were the same on all machines. We wrote a basic shell script that we ran on the machine where we modified the configuration files and it would use scp to copy the files to the rest of the machines. Once the initial excitement wore off, we looked for a better way. The tools we looked at first were Chef and Puppet. We eventually chose Puppet which at the time was richer and much better maintained. We went through their site wiki for documentation, used the fantastic Pulling Strings with Puppet book and in under a week we had created puppet recipes that could take an empty machine and turn it into a hstack node. Any functionality that we decided to implement later on (like the NameNode High-availability recipe) got its own puppet manifests right from the start.

How we use Puppet

We tend to overuse puppet a bit, by using it as both our configuration management and deployment tool. To illustrate, these are the steps we take when deploying a new version of hstack to our cluster:

  1. Trigger a build in Hudson for Hadoop, HBase or anything else we want to deploy.
  2. We click a link next to the newly completed build to push the resulting archives to the Puppet Master repository.
  3. Using ssh, we start up the Puppet clients on each machine. In the future we want to keep them running at all times, but since we’re doing development on the current cluster, a running Puppet tends to mess with our tests (restarting daemons when we kill them for testing is an example).
  4. The last step is to trigger the change in the configuration. The Puppet Master pulls the configuration from a git repository, so we just have to change the version number and push the new file back to source control.

The heavy lifting in all of this process is done by Puppet – it figures out what nodes need to be upgraded, pushes the new archives, changes configuration files and restarts services. All in all you can just do the steps above, go for a coffee and when you come back the cluster is running the new version.

Right now we are open-sourcing on GitHub, puppet recipes for:

  • creating the user under which the entire hstack runs.
  • changing system settings, like the ssh keys, authorizing machines to talk to each other, aliases for hadoop and hbase executables, /tmp rules.
  • standalone puppet module to deploy Hadoop
  • standalone puppet module to configure the Hadoop NameNode in High-Availability mode via DRBD, heartbeat and mon. For more details on this recipe check out the cloudera blog post on this topic.
  • standalone puppet module to deploy HBase
  • standalone puppet module to deploy Zookeeper.

All of the above are what we use day-to-day, and have been tested on CentOS 5.3 with puppet 0.25.1 and 0.25.4. To use them you need a functional puppet master (see details on how to configure puppet’s daemons here). Once you have that working, drop the code from our repository on top of your puppet master, and read on for the various options of each module / recipe.


The repository contains the three typical puppet folders:

  • manifests – aside from the mandatory sites.pp and nodes.pp files, our repository also stores in this folder some recipes that weren’t big enough to warrant a module, as well as some utility definitions.
  • templates – this folder stores configuration templates for the user and system. Hadoop and HBase templates are stored with their respective modules. Here you will have to add your own ssh keys and tweak system variables
  • modules – the bulk of the recipes lies here. A Puppet module is a self-contained unit that does one thing only. We named each folder by its purpose: hadoop, hbase, mon, zookeeper.

In order to do a full deployment using these recipes, your nodes have to include both the general user and system recipes, as well as the modules. To adapt the final outcome for your specific cluster setup there are a few variables that you can set, and which I will explain for each module in part.

Basic node setup

In this brief section I will explain what needs to be defined in the puppet recipe in order to perform basic setup of a machine (getting a dedicated user account to run hstack and setting the correct ssh permissions). The recipes we’re going to use are stored in the manifests/virtual.pp and manifests/environment.pp.

First up, adding a new user to the system. The recipe for creating users does not work fully out of the box – you need to give it your own ssh keys to use. To do so, replace the files under templates/ssh_keys/keys with actual content (they’re just dummy files right now). You might also want to change the password – you need to write the already encrypted version into manifests/virtual.pp. The default password is hadoop, although it might not seem so.

Another customization you can make is to the system settings. These are controlled through the environment.pp file. In this file you can change the maximum number of file descriptors (default 200.000), aliases for commands and the tmpwatch(http://linux.about.com/library/cmd/blcmdl8_tmpwatch.htm) deamon configuration.

Once this is done, you can add your first node, run the configuration and enjoy the results – a brand new user ready to take on Hadoop. :)

Deploying Hadoop

The hadoop module has options and configurations you need to change to make it suit your setup. The most important ones you need to take care of are: in the modules/hadoop/templates/conf/ folder you can create a sub-folder for each separate environment you want to manage. This allows you to keep completely different templates and use the same code to push them to their respective locations. This is where the hadoop (core, hdfs and map-reduce) configuration templates are stored. By default only the critical properties are filled in, some with values, but most are using variables (you can spot them by the <%= %> marks). These variables can be set on a per-node basis, but I just define a base-node with the common values and have all the other nodes in a cluster extend that. Some of the variables that you need to set are:

  • $hadoop_version – the version of hadoop to deploy. The recipes are assuming as version the part of the tar name after hadoop-. For example if your generated archive is called hadoop-core-0.21.0-31, the version is “core-0.21.0-31”. Another assumption is that the archive contains a folder with the same name, which will be unpacked in the user’s home (/home/hadoop) and then symlinked to the final $HADOOP_HOME destination. This tactic allows you to retain all versions ever deployed, and you can easily switch back between them by just re-creating the symbolic link to an older folder.
  • $hadoop_home – final directory to which the unpacked hadoop will be symlinked to
  • $environment – match this with the subfolder in the templates/conf that you want to use
  • $hadoop_default_fs_name – the uri to the NameNode; this gets pushed into hdfs-site.xml and core-site.xml
  • $hadoop_namenode_dir – the folder where the NameNode stores its files
  • $hadoop_datastore – the folders where the datanode stores its files. This is expressed as a list: ["/var/folder1", "/var/folder2"]
  • $mapred_job_tracker – uri to the Hadoop Map-Reduce job-tracker.
  • $hadoop_mapred_local – the local folder where the TaskTrackers should store its files. This is also a list of folders.

Hadoop HDFS High Avalability

To use this you need to dedicate two machines to the NameNode role and set some other variables (on both machines):

  • $virtual_ip – the common IP address that both computers will be sharing
  • $hostname_primary | $hostname_secondary – the hostnames of the two machines to use
  • $ip_primary | $ip_secondary – the IP addresses of both machines
  • $disk_dev_primary | $disk_dev_secondary – the device to use as the starting partition for drbd
  • $hadoop_namenode_dir – where the namenode will store its files; this will also be the mount point for the newly created /dev/drbd

Deploying Zookeeper

A standalone zookeeper deployment is recommended for HBase. You should have at least 3 servers running zookeeper – if your cluster is low on load, you can just use 3 of your regionservers. To define a zookeeper node, import the zookeeper module and set these variables:

  • $zookeeper_version – same as for hadoop, the version to deploy
  • $zookeeper_datastore – where to store the zookeeper data files
  • $zookeeper_datastore_log – where to store the zookeeper log
  • $zookeeper_myid – this needs to be configured for each individual node, as it assigns an unique id to the machine

Deploying HBase

The HBase deployment is very similar to the Hadoop one. You need to adjust the HBase configuration stored under modules/hbase/templates/conf/$environment to add or remove any properties specific to your installation. In the configuration files you will spot some of the variables you have to set, like:

  • $hbase_version – what version to deploy; similar to the Hadoop case, this is part of the archive and folder name. If you use the default build script, it should pack it just right
  • $hbase_home – final directory to which the unpacked hbase will be symlinked to
  • $zookeeper_quorum – comma separated list of zookeeper nodes
  • $hbase_rootdir – HBase directory in HDFS, usualy /hbase

This was a quick run through some, but not all of the options that each recipe provides and requires. I strongly encourage you to go look in the provided nodes.pp which has a sample node configured with all of the options. Also, as a best practice, when adding your own properties to the template configuration files, you should try to use variables and set those on a per-node basis where applicable.


Using puppet to deploy the entire stack in an easy, predictable way has helped us a lot in not delaying cluster-wide upgrades just because it would be too hard to do. If you are deploying hstack on your custers, and decide to use these recipes, let us know if you find any bugs. If you’re using another way of pushing data and configuration to your cluster, we’d like to hear about it as well in the comments.

Written by Cristian Ivascu

May 21st, 2010 at 4:38 pm

27 Responses to 'Hadoop/HBase automated deployment using Puppet'

Subscribe to comments with RSS or TrackBack to 'Hadoop/HBase automated deployment using Puppet'.

  1. […] Hadoop/HBase automated deployment using Puppet at hstack (tags: puppet hadoop deployment) This was written by andy. Posted on Saturday, May 29, 2010, at 1:04 am. Filed under Delicious. Bookmark the permalink. Follow comments here with the RSS feed. Post a comment or leave a trackback. […]

  2. Hey guys,
    In addition to GitHub, you may want to look into publishing your Puppet modules on the newly launched Puppet Forge:


    Great post, BTW. My team has a similar setup, except that we use CruiseControl.rb for CI.

    Kevin Stewart

    29 May 10 at 8:43 PM

  3. An Amazon EC2 AMI that supports all this would be amazing. Not sure if it is possible to do the high-availabilty name server though.

    Sam Pullara

    30 May 10 at 3:59 AM

  4. Hey Kevin,
    That’s a good idea. Will publish it. We actually looked over your recipes when we started to play with Puppet ;)

    Cosmin Lehene

    30 May 10 at 5:55 PM

  5. Sam, we’re not working actively with EC2, though it would be cool to try. Regarding the HA setup, you could ask @linbit_drbd(twitter).
    Also crawl from this article http://www.oreillynet.com/xml/blog/2008/05/awsec2_preparing_for_ec2_persi.html

    Cosmin Lehene

    30 May 10 at 6:00 PM

  6. Thanks for allowing the community to view your Puppet config! Puppet for the win

    CentOS 5.5

    3 Jun 10 at 1:30 PM

  7. You can avoid triggering updates using SSH by setting your clients in listen/push node, so they only update their configuration when told to do so by the puppet master node.

  8. Hi guys, I work for Puppet and we recently launched a module site (http://forge.puppetlabs.com) and would love to have this content on there. Really excited about this. Can you shoot me an email at michael at puppetlabs.com? I can help you out with info if you’re interested.

    Michael DeHaan

    9 Jun 10 at 7:58 PM

  9. I see Kevin beat me to it, good job Kevin :) … Anyway, please do shoot me an email!

    Michael DeHaan

    9 Jun 10 at 7:59 PM

  10. Cristi that’s a good point, thanks! BTW ping us if you get in Bucharest.

    Cosmin Lehene

    9 Jun 10 at 11:42 PM

  11. @Cristi Yes, push configuration has a ton of benefits – reduced chatter being the first I can think of


    10 Jun 10 at 8:38 AM

  12. […] Labs announced on Thursday that Adobe Systems is publishing code for managing Hadoop on the Puppet Forge community development site. (Disclosure: I am an adviser to Puppet […]

  13. […] Labs announced on Thursday that Adobe Systems is publishing code for managing Hadoop on the Puppet Forge community development site. (Disclosure: I am an adviser to Puppet […]

  14. tl;dr :-)

    However, I do notice that the link to the wikipedia article about this abbreviation is broken: the “‘t read” part of the link is missing. Maybe because of the apostrophe? If yes, one of the available shorcut links will fix this: http://en.wikipedia.org/wiki/Wikipedia:TLDR for instance.


    1 Jul 10 at 11:30 AM

  15. Hey Frederic,
    Thanks for catching that. I have updated the link to the short version.

  16. I have been using the recipes . I am assuming this is the place I give comments about the recipes

    1) The class jdk is missing. Installing sun’s java-6 in Ubuntu 10.04 seems to be slightly different than in 9.x version since the package has moved to a different repository
    2) The meta resource name / @user , @group seems to have changed in puppet 0.25 . I had to remove the @ to make it work

    I am using Ubuntu 10.04 and the client that comes with it and also the latest version for the puppet master


    13 Jul 10 at 3:48 AM

  17. Hey Harihara,

    We use CentOS on our systems, and also to test the recipes. We haven’t tested them on Ubuntu, so thanks for doing that for us.

    Now for some answers:
    1. I have removed the reference to the jdk manifest – it uses an internal yum repository to install java, so it won’t be useful to other people. It was a slip that it got into the example nodes.pp file :)
    2. We need to test the recipes on the latest puppet. On 0.25.1 it worked ok with the @ and using realize to invoke the user/group creation. Will have more feedback on that next week, I think.

    Thanks for catching these problems, and if you have a git diff, fixing some problems, send it and we will integrate them into the main repository.

    Cristian Ivascu

    15 Jul 10 at 9:21 AM

  18. Hi
    Some more . On Ubuntu there is no ruby-shadow package and things seem to work without it.

    I am guessing that I will have to statically set up the content of the masters and slaves file . I was wondering whether I can vary the number of nodes as they come up .I have this tied up to a Ubuntu PXE install .

    I am not sure how to distinguish between name nodes and data nodes. I don’t want to have to edit the nodes.pp file every time I add a node . Can you throw some light on that


    15 Jul 10 at 6:24 PM

  19. Hi
    Some answers to my own questions :-) With the new puppet 0.25 I can specify wild cards for nodes . So I can do something like /slavenode*/ { } and /namenode*/ { } in the nodes.pp to define appropriate nodes. (http://docs.puppetlabs.com/guides/language_tutorial.html#matching_nodes_with_regular_expressions)

    I am trying to start up Hadoop on machine start but this is failing since puppetd cannot install until after the first reboot . Will post the solution here if I find it .

    Thanks for sharing these recipes .


    16 Jul 10 at 11:46 AM

  20. Hi
    There is one problem I am facing . I changed the hadoop_datadir and haoop_namenode variables . But the permissions for these directories are still with root not the hadoop user . Is this an issue that you have faced / seen before


    17 Jul 10 at 3:06 PM

  21. I solved the problem of the data_store . Puppet version 0.25 on Ubuntu does not seem to support the concept of lists — not sure why . It works fine with a single value instead of a list

    I had a vague problem to solve . I had to remove the word ssh-rsa from the template file authorized_keys (i.e. templates/keys/ssh_keys/authorized_keys) for the ssh to work


    19 Jul 10 at 2:10 PM

  22. […] with access times under 50 milliseconds. Adobe software engineers has developed what they term “Hstack”, in which they have integrated HDFS, HBase and Zookeeper with a Puppet configuration management […]

  23. […] in Java, cloud computing. Leave a Comment Interesting article @ Infoq and at Adobe and here @ hstack . This looks like a nice capability. I can personally speak to the amount of pain involved with […]

  24. Re #3 in “How we Use Puppet”, take a look at MCollective. Among other things it lets you trigger puppet runs on demand on sets of machines based on metadata (e.g., “all dev machines”, etc.).

    Jeff Sussna

    26 Sep 12 at 6:04 PM

  25. Excellent post. thanks!

  26. Thanks for sharing this- good stuff! Keep it up the great work, we look forward to reading more from you in the future!

    DevOps Consult

    3 Mar 17 at 8:50 AM

  27. Fantastic post, thank you for such clear explanation.

Leave a Reply