How we work
We usually develop against trunk code (for both Hadoop and HBase) using a mirror of the Apache Git repositories. We don’t confine ourselves to released versions only, because we implement fixes, and there are always new features we need or want to evaluate. We test a large variety of conditions and find a variety of problems – from HBase or HDFS corruption to data loss etc. Usually we report them, fix them and move on. Our latest headache from working with unreleased versions was HDFS-909 that causes the corruption of the NameNode “edits” file by losing a byte. We were comfortable enough with the system to manually fix the “edits” binary file in a hex editor so we could bring the cluster back online quickly, and then track the actual cause by analyzing the code. It wasn’t a critical situation per se, but this kind of “training” and deep familiarity with the code gives us a certain level of trust regarding our abilities to handle real situations.
It’s great to see that it gets harder and harder to find critical bugs these days, however, we still brutalize our clusters and take all precautions when it comes to data integrity1.
Testing distributed systems is hard. There aren’t many tools or resources. There are some promises for performance and scalability benchmarking tools, thanks to Yahoo! (we too await the open-sourcing of the YCSB tool), but right now you have to roll your own, and it takes time. There’s no clear test plan for distributed systems, no failover benchmarking tool, nothing on reliability, availability or data consistency.
Your system can fail no matter how well you thought you tested it, even if it’s sunny outside and you’re throwing a party (especially then). Google, Twitter, Amazon – all have had downtime. Everyone fails once in a while. It’s only a matter of time and users tend to tolerate a short downtime or performance degradations. On the other hand, what users will not tolerate is losing their data2. We are completely paranoid about losing data. If other failure scenarios resulting in degraded performance or even a little downtime are bearable, losing data is not.
We try to learn our systems by heart and be able to fix anything fast or even while the system is running. After more than a year, we’re OK with keeping data for our clients3, but we’re still testing and taking precautions.
Really thorough testing of ALL the solutions that we use has paid off. It’s a lot of work, because you have to build the scaffolding for it, but going back and forth keeping up with the changes and pushing fixes is a sure way to know the system in depth.
Demystifying HBase Data integrity, Availability and Performance
It’s good to know the strengths of a system, but it’s more important to be aware of and understand its limitations. We have extensive suites of tests covering both of them. Testing performance is pretty straightforward, but testing data integrity is the hardest, and here we spend most our time.
Integrity implies the data has to reach “safety” before confirming the request to the client, regardless of any hardware failures that might happen.
HBase confirms a write after its write-ahead log reaches 34 in-memory HDFS replicas5. Statistically, it’s a rather small6 probability to lose 3 machines at the same time (unless all of your racks are on the same electrical circuit, power transfer switch, PDU or UPS). HDFS is rack aware so it will place your replicas on different racks. If you place and power your machines correctly, this is safe in most cases. If this won’t be enough for our clients, in certain critical applications, we will come up with stronger guarantees. (E.g. make sure that data is flushed to disk on all 3 replicas – not here today).
There are many questions that arise even if you do flush to disk. In a full power loss scenario, even if you flush to disk you need to consider OS cache, file system journaling, RAID cache and then disk cache. So it’s debatable whether your data is safe after a flush to disk. We use battery-backed write cache RAID cards that disable the disk caches. However, we’d rather make sure our racks are powered correctly than rely on disk flush.
Most of our development efforts go towards data integrity. We have a draconian set of failover scenarios. We try to guarantee every byte regardless of the failure and we’re committed to fixing any HBase or HDFS bug that would imply data corruption or data loss before letting any of our production clients write a single byte.
We feel the same about availability. When a machine dies, data served by that box will be unavailable for a short window7. This is a current limitation, and while we know how to make the system 100% available, given you lose a box or two, it’s a matter of prioritizing our efforts that we have chosen not to put effort into it yet. The reality is we can afford having a short8 downtime for data partitions – as long as we don’t lose any data. Also we don’t expect to have machines failing too often.
So, for us, juggling with Consistency, Availability and Partition tolerance is not as important as making sure that data is 100% safe.
HBase9 performance is good enough for us. That is, it’s more than we need right now. Would you strive to reach 5ms read performance instead of 10ms, or 1 second max unavailability instead of a few minutes when a server crashes, if you can’t guarantee data safety? We wouldn’t. Just as you’d accept a credit card failure once, you wouldn’t accept your accounts being wiped out anytime. So, we choose to spend our resources on ensuring data integrity.
Getting close to the 7ms average disk response time10, for small records, is possible with the current architecture. As always, the devil is in the implementation details. The architecture promises linear scalability, but it’s the implementation that makes it reliable. Moreover, we all know that data isn’t accessed uniformly random – this is the worst case scenario. We get ~1ms reads for data in memory today, and the read performance and throughput can be improved 10 fold by adding caching. (see this article)
Our performance results are notably better than the ones in the YCSB test paper. That’s for another post though.
Availability and Random Read Performance are possible “limitations” that we are OK with (for now); we are extremely happy with random write and sequential read performance against billions of rows11, however.
While you can cache for reads, scaling writes is harder. Write performance and sequential read performance enable two of the most important use cases: heavy write volumes12 and efficient distributed processing.
HBase has a great random-write performance. We are using HBase 0.21, which DOES sync the write-ahead log after every put call in the RegionServer, so the data is in the write buffer of at least 3 nodes13. In an RDBMS for example, you can replicate the data for improved read performance, but you can’t scale writes and total data size, unless you partition it. And when you partition the data you lose the original properties such as transactions, consistency, and your operational costs can skyrocket.
As systems mature, great write performance will not be solely an HBase advantage; we expect other storages to reach this performance, just as we expect HBase to reach the optimal random-read performance.
But what use is in being able to keep such large amounts of data without being able to process them efficiently?
Sequential Reads (Scans)
Again, our tests using MapReduce show great performance. HBase is built on the Bigtable architecture, which was thought-out to work with MapReduce, which makes it also a great fit for OLAP. Data location is deterministic and sequential rows are stored sequentially on disk, so HBase can read every 256 MB(configurable) of your table in a single request because data is not fragmented. It can do it in parallel too. So given enough processing power you can have each disk reading at full throttle.
HBase is an inherently consistent system. After you write something, modifications are immediately available. You can’t get stale data, or have to reason about quorum reads. We think consistency is good, for a multitude of reasons: if you write an application over a consistent system, application logic is much simpler. One doesn’t have to take into account stale data, it’s just like single threaded programming: you’re going to read what you’ve written earlier14. Also, consistency is a solid base to build more complex primitives: transactions and indexes, increment operations, test-and-set semantics, etc.
It all comes down to engineering choices: it’s a good exercise for the reader to determine if a system which defaults to eventual consistency, that can accept writes at any moment on any node (e.g. using consistent hashing) and has data fragmented across the cluster, can perform optimally when it comes to sequential reads. How much network chat is needed to do a table scan?
It’s all about what we think is a sensible default: availability and partition tolerance deal with relatively isolated scenarios: you can compute the probability of losing a node or getting your network split in two. It’s relatively low. However, consistency is something you deal with in every operation you do.
This is all getting a little philosophical, but here’s a list of questions (not rhetorical ones), related to this:
An eventual consistent system could be configured to support full consistency and/or data ordering. Would this impact or degrade other attributes like availability and performance? A system that can juggle with C, A and P, is quite flexible. But what part of CAP do you want to support by default, and what’s the impact when you change it afterwards?
Our assumption is that building on consistency is an appealing and sound decision, and any architecture that doesn’t handle this in its default design will lose the performance and availability when forcing it later on. Partition tolerance is not something that we think is worth handling within a single datacenter (redundant datacenter equipment investments are pretty much the norm for both electricity and networking). We do however care about partitioning when doing multi-datacenter replication.
Which of C, A or P do you think will hold the greatest impact? (Hint: for us it’s C :D)
HBase’s edge is in the H
We created a fair amount of tests that we maltreat our system with. It takes effort to implement correct fault tolerance and there’s an advantage in relying on Hadoop for it. Also, in the last 4 years, there was a large client base15 that validated Hadoop by using it in their production systems (especially Yahoo!). This had a real impact in the stability and fault tolerance of the system.
Now consider a different system, built from scratch. It has to enable and test all that Hadoop does starting from 0 (again, architecture is a promise, but implementation is what you USE). In a best case scenario, this system will gather a critical mass, a community will be created, and it will evolve organically, etc. Hadoop and HBase have that today, and it’s a big advantage.
We pride ourselves in keeping up with new technologies, but we think that Hadoop and HBase are over the “safety” threshold, for what we need to do.
HBase has an “edge” in Hadoop over other technologies, in that, just like Hadoop, it fills the gap between storage scalability, fast processing and cost-efficiency. Why was Hadoop successful? We think it’s because it didn’t rely on a narrow vertical need. Hadoop did not build something that was impossible before. We had NAS systems, and OLAP cubes for data processing, but Hadoop made this possible for any development group, with little initial16 investment, hence democratizing scalable data processing.
Many Hadoop developers are paid by companies which use Hadoop and see its value. Corporate sponsorship is a catalyst for progress in open-source systems (see MySQL, Eclipse etc.). Hadoop got started as a component in the Nutch search engine, but it was Yahoo that invested resources, and helped make it a success story.
You can dig into Hadoop’s architecture and learn how it works (just like we did), or you could take advantage of the large ecosystem around it. The community report bugs and help people get started, there are books, and even paid support (look at Cloudera).
How does this relate to HBase? HBase is the Hadoop database. It has the best Hadoop integration. It uses HDFS for storage and MapReduce for distributed processing. Once you have a Hadoop cluster, you already have one half of an HBase cluster. It’s only natural that companies that are using Hadoop will be looking at HBase, if they aren’t already using it. And, following Hadoop’s model, they will invest resources and money, adding to HBase’s momentum. The companies that use HBase today sustain the core HBase development team. We too, are contributing back to both HBase and Hadoop. It’s only natural to invest in something that supports your business.
HBase is more complex than other systems (you need Hadoop, Zookeeper, cluster machines have multiple roles). We believe that for HBase, this is not accidental complexity and that the argument that “HBase is not a good choice because it is complex” is irrelevant. The advantages far outweigh the problems. Relying on decoupled components plays nice with the Unix philosophy: do one thing and do it well. Distributed storage is delegated to HDFS, so is distributed processing, cluster state goes to Zookeeper. All these systems are developed and tested separately, and are good at what they do. More than that, this allows you to scale your cluster on separate vectors. This is not optimal, but it allows for incremental investment in either spindles, CPU or RAM. You don’t have to add them all at the same time.
The HStack can be a pain to deploy. We took some time to understand the problem, and now we have Puppet recipes for everything. We can set up a cluster completely unattended. We’ll try to push all these back to the community and help other users have it easier, so stay tuned.
Zookeeper, Hadoop etc., let us implement transactions, simple queries and data processing. We want to have a system with such capabilities (these are requirements for large applications). Yes, you can drop some of them but you can’t drop them all. We don’t want a tool that’s missing too much. We want the good parts from an RDBMS, like queries and transactions, while still having distributed processing and cheap scalability. We don’t drink the “NoSQL” kool-aid. We’re not running away from SQL, we’re running towards something that is built from scratch on the premises of scalability and high availability.
About Community and Leadership
This is the biased part of the article, and it should be, because it’s about our relation with the HBase development team. Stack, Ryan, JD were always very receptive. They always help with the issues that we have, whether it’s a bug or a new feature that we need. There’s an open and democratic decision process when prioritizing work with HBase. The team is well balanced and there’s not a single company that drives HBase’s direction.
They are genuinely passionate about their work and strive to have it used by people. We attended one of their regular developer meetups and it was eye opening that developers coming from different backgrounds and companies are working together as a team. We think open-source projects benefit from good leadership and Michael Stack has done a great job with HBase.
Another aspect that appeals to us is the maturity of the development team. They focus on long term benefits. For example, the current focus is to improve the architecture of HMaster and multi-datacenter replication. However, in light of recent performance benchmark reports they took the time to understand the situation, validated with the community that it’s OK to stick to the current plan and didn’t switch focus.
Maturity is also shown in the way that the team positions HBase in relation to other competing projects; they let facts speak rather than opinions. They don’t engage in holy wars and this, to us, seems the right way to build a healthy community.
In the end we’d hope technologies wouldn’t be dismissed based on superficial or biased perception, FUD, or tweets. We don’t like it when talks are based on assumptions without knowing ALL the details of a certain problem or technical choice, and this seems to become a vicious trend in some circles. Hopefully, the reasons explained in this article can help you make your own informed assessments, and see what works for you.
- If anyone knows how to remotely “break” a network card, or RAM stick, please, let us know :) ↩
- Ever heard of Sidekick? http://www.sophos.com/blogs/gc/g/2009/10/13/catastrophic-data-loss-sidekick-users/ ↩
- By clients we mean our internal clients. Even though they have public data, our system is not publicly available. ↩
- “3″ is also something configurable, it’s the default replication factor in HDFS ↩
- This behavior is only available on HDFS version 0.21.0, or 0.20.4 with patches. Take a look at HDFS-826 ↩
- By rather small, we mean that it’s that small that even if you use 4 replicas, the “cost” surpasses the benefit. ↩
- Depends on cluster load and configuration. It takes ~40 seconds for 800 regions out of 5300 when 1 out of 7 regionserver dies in our test. We used
hbase.regions.percheckinat 100. We’ll do some thorough measuring as well and document it. ↩
- it’s usually a minute or two, but depends on how many regions need to be reassigned by HMaster. We’ll get back with some metrics on this too. ↩
- HBase 0.20 and 0.21 ↩
- It’s a fact that if you have more data than RAM, uniform random read latency approaches the storage latency, at best. Our average response time is 7ms for a 10K RPMS SATA disk. We haven’t tested SSDs yet, because they are not economically viable for us right now. ↩
- We tested with approx 3B rows (approx two orders of magnitude more data than available RAM – so data wasn’t served from the cache). See the cluster configuration in this article ↩
- Why do you need heavy write performance? See here for a description of Farmville’s architecture ↩
- This behavior is only available on HDFS version 0.21.0, or 0.20.4 with patches. Take a look at HDFS-826 ↩
- We don’t want to push the analogy too far, but multi-threaded programming does not yet offer a simple and clear programming model: threading, actors, STM, etc. There is no clear winner, and they all make the application code complex. ↩
- See the Hadoop “Powered by”, as well as the Hadoop Summit proceedings for more companies that are using Hadoop : Visa, IBM, Reuters, NY Times, etc. ↩
- Of course, TANSTAAFL, if you want to do heavy processing, you need beefy machines, and lots of them. ↩