We’ve been building on, fixing and deploying HBase for the last 4 years.
We’ve written about why we’re using HBase but not much about what for.
Tomorrow, at HBaseCon, I’ll be talking about our low latency OLAP platform built on top of HBase.
I’ll cover both functional and technical aspects of the system and go through some of the strategies that we use to provide high-throughput, real-time OLAP queries.
If you’re attending the conference we hope to see you there.
Update: here are the slides, video should come soon.
Next Tuesday (31st of May 2011) we’ll host a HBase/Hadoop meetup at the Adobe office in Bucharest. We’ll have Lars George – HBase committer, author of “HBase: The Definitive Guide“, Cloudera Solution Architect for Europe as a special guest.
Our hope is to get to meet more HBase/Hadoop local users to share knowledge. So if you’re using HBase or Hadoop or plan to use them you’re welcome.
Leave a comment if you want to sign-up for an up to 10 minutes lightning talk.
HBase intro – Lars George
Big Data with HBase and Hadoop at Adobe
Lightning talks (10m each)
HBase status and roadmap – Lars George
After: beers at Rock’n Pasta or downtown
Deploying and configuring Hadoop and HBase across clusters is a complex task. In this article I will show what we do to make it easier, and share the deployment recipes that we use.
Before going into how we do things, here is the list of tools that we are using, and which I will mention in this article. I will try to put a link next to any tool-specific term, but you can always refer to its specific home-page for further reference.
- Hudson – this is a great CI server, and we are using it to build Hadoop, HBase, Zookeeper and more
- The Hudson Promoted Builds Plug-in – allows defining operations that run after the build has finished, manually or automatically
- Puppet – configuration management tool We don’t have a dedicated operations team to hand off a list of instructions on how we want our machines to look like. The operations team helping us just makes sure the servers are in the rack, networked and powered up, but once we have a set of IPs (usually from IPMI cards) we’re good to go ourselves. We are our own devops team, and as such we try to automate as much as possible, where possible, and using the tools above helps a lot.
Performance is one of the most interesting characteristics in a system’s behavior. It’s challenging to talk about it, because performance measurements need to be accurate and in depth.
Our purpose is to share our reasons for doing performance testing, our methodology as well as our initial results, and their interpretation. Hopefully, this will come in handy for other people.
The key take-aways here are that:
- Performance testing helps us determine the cost of our system; it helps size the hardware appropriately, so we don’t introduce hardware bottlenecks or spend too much money on expensive equipment.
- A black-box approach (only the actual test results: average response time) is not enough. You need to validate the results by doing an in-depth analysis.
- We test not only our software, but try to look at all the levels, including libraries and operating system. Don’t take anything for granted.
How we work
We usually develop against trunk code (for both Hadoop and HBase) using a mirror of the Apache Git repositories. We don’t confine ourselves to released versions only, because we implement fixes, and there are always new features we need or want to evaluate. We test a large variety of conditions and find a variety of problems – from HBase or HDFS corruption to data loss etc. Usually we report them, fix them and move on. Our latest headache from working with unreleased versions was HDFS-909 that causes the corruption of the NameNode “edits” file by losing a byte. We were comfortable enough with the system to manually fix the “edits” binary file in a hex editor so we could bring the cluster back online quickly, and then track the actual cause by analyzing the code. It wasn’t a critical situation per se, but this kind of “training” and deep familiarity with the code gives us a certain level of trust regarding our abilities to handle real situations.
It’s great to see that it gets harder and harder to find critical bugs these days, however, we still brutalize our clusters and take all precautions when it comes to data integrity1. Read the rest of this entry »
Our team builds infrastructure services for many clients across Adobe. We have services ranging from commenting and tagging to structured data storage and processing. We need to make sure that data is safe and always available; the services have to work fast regardless of the data volume.
This article is about how we got started using HBase and where we are now. More in depth reasoning can be found in the second part of the article
If one would have asked me a couple of days ago why or how we chose HBase, I would have answered in a blink that it was about reliability, performance, costs, etc.(a bit brainwashed after answering “correctly” and “objectively” too many times). However, as the subject has become rather popular lately1, I reflected deeper about “how” and “why”.
The truth is that, in the beginning, we were attracted to working with bleeding edge technology and it was fun. It was a projection of the success we were hoping to have that motivated us. We all knew stories about Google File System, Bigtable, GMail and what made them possible. I guess we wanted a piece of that, and Hadoop and HBase were one logical step to reach that. Read the rest of this entry »