Blogging about the Hadoop software stack

Archive for the ‘nosql’ tag

Hadoop/HBase automated deployment using Puppet

with 25 comments


Deploying and configuring Hadoop and HBase across clusters is a complex task. In this article I will show what we do to make it easier, and share the deployment recipes that we use.

For the tl;dr crowd: go get the code here.

Cool tools

Before going into how we do things, here is the list of tools that we are using, and which I will mention in this article. I will try to put a link next to any tool-specific term, but you can always refer to its specific home-page for further reference.

  • Hudson – this is a great CI server, and we are using it to build Hadoop, HBase, Zookeeper and more
  • The Hudson Promoted Builds Plug-in – allows defining operations that run after the build has finished, manually or automatically
  • Puppet – configuration management tool We don’t have a dedicated operations team to hand off a list of instructions on how we want our machines to look like. The operations team helping us just makes sure the servers are in the rack, networked and powered up, but once we have a set of IPs (usually from IPMI cards) we’re good to go ourselves. We are our own devops team, and as such we try to automate as much as possible, where possible, and using the tools above helps a lot.

Written by Cristian Ivascu

May 21st, 2010 at 4:38 pm

HBase Performance Testing

with 16 comments

Performance is one of the most interesting characteristics in a system’s behavior. It’s challenging to talk about it, because performance measurements need to be accurate and in depth.

Our purpose is to share our reasons for doing performance testing, our methodology as well as our initial results, and their interpretation. Hopefully, this will come in handy for other people.

The key take-aways here are that:

  • Performance testing helps us determine the cost of our system; it helps size the hardware appropriately, so we don’t introduce hardware bottlenecks or spend too much money on expensive equipment.
  • A black-box approach (only the actual test results: average response time) is not enough. You need to validate the results by doing an in-depth analysis.
  • We test not only our software, but try to look at all the levels, including libraries and operating system. Don’t take anything for granted.

Read the rest of this entry »

Written by Andrei Dragomir

April 26th, 2010 at 8:18 pm

Posted in Uncategorized

Tagged with , , , ,

Why we’re using HBase: Part 2

with 17 comments

The first part of this article is about our success with the technologies we have chosen. Here are some more arguments (by no means exhaustive :P) about why we think HBase is the best fit for our team. We are trying to explain our train of thought, so other people can at least ask the questions that we did, even if they don’t reach to the same conclusion.

How we work

We usually develop against trunk code (for both Hadoop and HBase) using a mirror of the Apache Git repositories. We don’t confine ourselves to released versions only, because we implement fixes, and there are always new features we need or want to evaluate. We test a large variety of conditions and find a variety of problems – from HBase or HDFS corruption to data loss etc. Usually we report them, fix them and move on. Our latest headache from working with unreleased versions was HDFS-909 that causes the corruption of the NameNode “edits” file by losing a byte. We were comfortable enough with the system to manually fix the “edits” binary file in a hex editor so we could bring the cluster back online quickly, and then track the actual cause by analyzing the code. It wasn’t a critical situation per se, but this kind of “training” and deep familiarity with the code gives us a certain level of trust regarding our abilities to handle real situations.

It’s great to see that it gets harder and harder to find critical bugs these days, however, we still brutalize our clusters and take all precautions when it comes to data integrity1. Read the rest of this entry »

Written by Cosmin Lehene

March 16th, 2010 at 2:50 pm

Posted in hadoop,hbase

Tagged with , , , ,

Why we’re using HBase: Part 1

with 11 comments

Our team builds infrastructure services for many clients across Adobe. We have services ranging from commenting and tagging to structured data storage and processing. We need to make sure that data is safe and always available; the services have to work fast regardless of the data volume.

This article is about how we got started using HBase and where we are now. More in depth reasoning can be found in the second part of the article

Lucky shot

If one would have asked me a couple of days ago why or how we chose HBase, I would have answered in a blink that it was about reliability, performance, costs, etc.(a bit brainwashed after answering “correctly” and “objectively” too many times). However, as the subject has become rather popular lately1, I reflected deeper about “how” and “why”.

The truth is that, in the beginning, we were attracted to working with bleeding edge technology and it was fun. It was a projection of the success we were hoping to have that motivated us. We all knew stories about Google File System, Bigtable, GMail and what made them possible. I guess we wanted a piece of that, and Hadoop and HBase were one logical step to reach that. Read the rest of this entry »

Written by Cosmin Lehene

March 16th, 2010 at 2:45 pm

Posted in hadoop,hbase

Tagged with , , , ,