hstack

Blogging about the Hadoop software stack

HBase Performance Testing

with 17 comments

Performance is one of the most interesting characteristics in a system’s behavior. It’s challenging to talk about it, because performance measurements need to be accurate and in depth.

Our purpose is to share our reasons for doing performance testing, our methodology as well as our initial results, and their interpretation. Hopefully, this will come in handy for other people.

The key take-aways here are that:

  • Performance testing helps us determine the cost of our system; it helps size the hardware appropriately, so we don’t introduce hardware bottlenecks or spend too much money on expensive equipment.
  • A black-box approach (only the actual test results: average response time) is not enough. You need to validate the results by doing an in-depth analysis.
  • We test not only our software, but try to look at all the levels, including libraries and operating system. Don’t take anything for granted.

Reasons

So, why do we run performance tests? Apart from our peace of mind of whether our solution is good enough :D, here are some pragmatic reasons, that will hopefully explain what we measure and how:

To keep our customers happy, our service needs to work fast. The hardware has to be used efficiently (more bang for your buck); performance tests help us detect the limits of our entire stack (software / operating system / hardware), as well as estimate and plan capacity.

Unfortunately, testing only isolated components can hide issues that appear only in a distributed context. Cluster load tests are sometimes the only way to ensure that the system behaves properly.

When you provide a service, you need to think about KPIs and SLAs. You have to tell your prospective clients how fast your system is, and what performance they can expect under a specific load. You can only guarantee performance under a certain load, so you have to design your cluster for it.

We run these tests regularly, so we can compare performance. This allows us to maintain the initial performance baseline, or at least know how our changes impact it.

Our view on types of performance tests

We run several kinds of performance tests:

  1. Benchmark: we run these light tests automatically, on the continuous integration server. These give us “comparative” results between any pair of HStack builds / deployments. They give a high-level and inaccurate measurement of performance increase / decrease from commit to commit. So, if anybody adds a synchronized statement that serializes everything, we’ll know :).
  2. Load testing: tests should emulate a load similar to the normal or peak production conditions. We first run a performance test with 1 client. Then, when running with multiple clients, we can examine the behavior under concurrent access. Up to a threshold, the results should be the same.
  3. Stress testing: test with harder conditions than the normal load, and determine the limits of the system. Sometimes the resources of the system are intentionally reduced, to see the effect of the load. How does the system behave once you pull the plug on a server? Or three?
  4. Longevity tests: run a set of tests (for example, load tests), for a long time, to see how the system behaves under sustained load (for days). Does Java GC come into play? Will it swap?

We have two main use cases we care about: real-time read / write, and processing(map/reduce jobs).

Random Read / Write Results

All of our tests have been performed on our 7 server cluster, using Grinder. Each machine has dual Nehalem CPUs (8 cores), 32GB DDR3 and 24 x 146GB SAS 2.0 10K RPM disks.

The HBase test table is a long table1, 3 billion records, in about 6600 regions. The data size is between 128-256 bytes per row, spread in 1 to 5 columns. We’ll probably start testing with 1KB records for the next tests.

We use 4 injector servers (the thread count in the test results below is split equally between them). The injectors hit an F5 load balancer that sits in front of our cluster. We run get, put and combined tests (a mix of 70% gets and 30% puts).

We use a uniform random distribution to generate the keys for reads and writes, trying to hit a balanced amount of “hot” and “cold” data2. We make sure that the data is at least one order of magnitude higher than the cluster’s RAM capacity in order to exercise the disks. Another good approach is to try to hit 100% IO-utilization on your cluster’s disks (or as high as possible) to guarantee that you understand the load of your system when dealing with real-world data set size.

The load tests were performed using Hbase 0.21 (TRUNK), the one available on 02.25.2010. Hbase was running on Hadoop 0.21 trunk (same thing, the version available around that date). We’ll publish updates here or in another post with updated results, with newer versions.

Type Threads Mean Time (ms) Time Standard Deviation (ms) 90% percentile line (ms) Throughput3 (req / s) Number of Requests
Get 300 17.87 29.87 38.793 16788 2999374
Put 300 7.50 15.35 12.096 40000 2999978
Get / Put 300 14.62 15.83 33.976 20520 2999994
Get 600 55.68 66.92 161.571 10776 5999737
Put 600 9.96 17.08 15.812 60240 5999891
Get / Put 600 26.48 33.10 64.336 22659 5999783

Here’s how the mean request time distribution looks like overall:

And here’s how it looks like zoomed in on the requests that took less than 20 ms:

Mean request time by itself doesn’t say much about the real performance of your system. We also output data distribution indicators.

The standard deviation can expose problems hidden by the average: big skew between request times (Take two datasets [5ms, 5ms] and [0.1ms, 9.9ms] . They both have the same average of 5ms, but the standard deviation is different).

We calculate the 90% percentile line4 with R. This eliminates the longest running 10% percent of requests from the data set, which can distort your average. R is used to generate the histograms with the entire distribution of request times as well.

Statistics can help you augment the initial results. A basic data set can be “enriched” and carved in different ways, that help you draw smarter conclusions. The least we can do is look at some indicators and decide if we are measuring the right things.

Overall, writes are faster than reads. See here for an explanation on write-path in HBase. Basically, the WAL is an HDFS file, which is kept in memory on 3 replicas (depends on your replication factor), and flushed asynchronously. What we’d like to see here in terms of performance is single-row write latency approach the hard drive seek time.

For reads, the average request time increases with the number of threads5. This is somewhat expected, but only from a certain number of threads. Again, here we’d like to see almost linear performance up to, let’s say 200 concurrent clients. In our case, the growth is pretty much linear. Corroborated with low utilization on the cluster, this leads us to believe that there are still contention points in the HBase code. We solved some of them, but there is still work to be done here.

Lies, damned lies, and statistics

In and of itself, saying “we can do X requests/sec with a 95% percentile line of Y ms” won’t tell you much about what’s happening under the covers. The kind of test that we showed above is much like a black box. The indicators you get or compute can tell you, at best, whether your system is “broken”. There are other thunkgs that we’d like to know, most important being hardware usage.

The numbers are the tip of the iceberg; things become really interesting once we start looking under the hood, and interpreting the results.

When investigating performance issues you have to assume that “everybody lies”. It is crucial that you don’t stop at a simple capacity or latency result; you need to investigate every layer: the performance tool, your code, their code, third-party libraries, the OS and the hardware. Here’s how we went about it:

The first potential liar is your test, then your test tool – they could both have bugs so you need to double-check.

After you make sure that your tests are correct, you need to determine the testing tool overhead. For example, in Grinder, you write a test function in a Jython class, which Grinder wraps and instruments in order to get performance statistics. We added our own “witness”, in the inner most loop of the test function to give us the test time with java.lang.System.nanoTime to make sure that it’s correct.

As I mentioned, our service is a thin layer, a shim over HBase that offers a thrift interface, and a couple of other things. We did the tests both with and without this layer(thrift), so we can spot any potential problems in it. It’s a good exercise to ask yourself: what’s the time added by each layer of my system? For us, right now, it’s negligible, but in the future, we can expect this overhead to grow, and we have a baseline for its value.

Down and up again

At first we had erratic results: No two performance runs were the same, and our servers were almost idling. We started to get rid of the moving pieces in order to identify the hot spot. First we removed thrift, than the entire testing framework and wrote a simple test case (basically a long Java loop) which ran that on the cluster. Then we got rid of the cluster and ran the test on a single node. We profiled the code, and then got rid of the large table too. We ended up with running the test program, locally, on a table with one row (got rid of the I/O overhead in the equation, everything was served from memory).

Finally, after looking at the profiler logs and with the help of the guys in #hbase, we got to HBASE-2180. After applying the patch, we went back up the stack, and added all the layers of complexity back, one by one. In the end server load increased, mean request time decreased and the results were consistent.

One problem with this approach is that by trying to pinpoint the issue, you can lose track of the big picture and might even end up hiding the initial cause. You have to measure performance every step of the way to make sure the problem still reproduces.

The second big problem was that sometimes the throughput went down to 0 req/s for short periods of times (blackouts). We first looked at all logs on all the nodes, seeing some weird timeouts between them. At the time, the servers and network were not under heavy load. After a lot of debugging we turned to the OS logs (there was nothing left :): in the dmesg output, iptables was screaming out ip_conntrack: table full, dropping packet.. It turns out that on busy network servers, the firewall can get to a point where it will merrily drop connections 6, if the length of the connection queue is bigger than the default cap.

We were so fixated on identifying problems in the upper layers, that we totally forgot about the operating system. It usually IS your code, but don’t take anything for granted when identifying performance problems. Low-level components, like network and file system play a big part in performance tests.

Performance numbers: Hard Mode

How do you determine when you hit the limits of your system? More than that, how do you identify the actual problems starting from the initial results? Simply pushing on until something crashes or stops won’t do it. What’s worse, it won’t tell you anything about whether you are walking through the hardware sweet spot, or if you are spending money in the right place.

In order to identify the limits or bottlenecks you need a way to look inside the system – something like an x-ray. For code there are profilers, instrumentation and log files. Linux and derivatives have plenty of tools that you can use to get information on how CPU, memory and I/O cycles (both HDD and network) are spent. We use an ad-hoc combination of top, htop, iostat, dstat, netstat, iptraf, sar. Most of these are frontends to /proc, and functionality is duplicated between them. We look at different things, like overall network bandwidth used, CPU %, number of page faults, I/O queues average length, etc.

Can you give me an example?

Here are some interesting statistics obtained by real-time monitoring of our cluster when doing load tests:

  • During writes, the most relevant statistic is the network bandwidth: 600 Mbps – 800 Mbps, peaking over 1000Mbps, ~670Mbps incoming traffic7. The iostat service time ( svctm ) rises to a couple of milliseconds, CPU usage is split in half (50%-50%) between user and kernel space, and overall at 30% (or 70% idle).
  • Cumulative read/write statistics are about 300-600 Mbps, peaking at 900 Mbps.
  • During read tests, things are different; network bandwidth utilization is about 10 times less (60 Mbps) – a row is small, there is no region reassignment churn. Also iostat await time is about 7ms, and average queue size is about 1. Again, the CPU is not used.

Locks, or rather sub-optimal locks in the code are an “invisible resource”. Utilization patterns where CPU is low, network bandwidth is low and the disks can keep up with the requests indicate some type of resource or lock contention in a system.

Another problem we identified using iostat was that on one hard drive, we had twice the number of requests and twice the wait time. It was due to a bug in our kickstart8 bootstrap scripts, which instead of setting up one partition on each drive, created two partitions on the last but one. Performance testing not only gives you a baseline but it helps validate the configuration of your system.

Map-Reduce

Map-Reduce performance, and implicitly scan performance is very good. For testing, we are just running the map-reduce job itself. We have enough data and machines that we think we don’t need another test harness. We simply run the map-reduce jobs and compute statistics at the end.

Our real world jobs crunch through a large table, with ~ 530M large rows9. For map/reduce, we are crunching through this data with about ~50000 req/s. We are able to process the entire table for statistics in about 2h and 40. For processing smaller rows (like reading all the rows in the 3billion records test table), we see 200000 req/s throughput (peaking at 400000 req/s).

Tight Hadoop integration is one of HBase’s core features, and we want to see this translated in great Map/Reduce performance. The most important statistic here is CPU usage, which is high (60%-70%), with all cores used, followed by network bandwidth utilization10.

Cost

We are very careful what we spend our money on. Under-dimensioned hardware is a performance killer, but over-dimensioned one increases our costs. There’s no reason in sticking 20 hard drives in a server if the CPU / mainboard can’t handle it. In the same vein, having huge processing power and memory is useless when your network communication is strangled by a 100Mbps network board, or even worse, by another poorly dimensioned equipment down the road (core switch, etc). All the components of a machine (and by extension, of the cluster) have to have the correct size.

Our configuration uses 24 hard disks per server. Increasing the number of spindles significantly improves I/O by parallelization. But there’s always a push / pull relation between failover, performance and costs.

For lots of nodes, processing power (MapReduce) is higher, but you eat more power / network traffic in the data center (which translates in bigger costs). On the other hand, with smaller number of beefier servers, power utilization is smaller, but you run into the risk of having big churn during failover (a machine dies)11, because there’s a lot of data that need to get re-balanced in HDFS.

The point is that we’re careful about costs. We try to be thorough, get an in-depth look on how our hardware is performing, and decide whether we need to upscale or downscale our components for optimal usage.

Therefore in our production cluster, we decided to go with 16 2.5″ disks instead of 24, trading spindles for less power usage. 16 disks are enough for our current usage, and we can always add more when we run out of space.

Hammers, anvils and other assorted sundries

We use a bunch of great tools that really made life easier for us:

The main performance tool that we use is Grinder. Grinder lets you write your tests in Jython, and plug them in with whatever you want to test(DNS speed or weird protocols ? No problem). HTTP is built in; it has a console that lets you orchestrate your tests across multiple injector nodes. Grinder has a steeper learning curve, but it’s worth it, and we highly recommend it.

I already mentioned all the system monitoring tools that we use. When the performance tests are finished, we use R to load the data sets. R is great because it allows for an incremental approach to data analysis. It’s really fast, and you can dice the data, look at the distributions, eliminate outliers, etc. Also highly recommended.

More specific details on how we actually use everything, with recipes, in a later post. Also, now that YCSB just came out as open-source, we’ll give that a go as well.

Conclusions

We’re happy with our current results. We know that there are a lot of small things we need to do that will improve performance. Our current focus is to shape the whole load test methodology so that we can decide on the best cost / performance hardware configuration.

Andrei.


  1. because of the columnar storage, tables grow in both dimensions, like an excel sheet
  2. The initial 3 billion rows in the table were added with a map/reduce job.
  3. The throughput in the table is computed by using the formula (1000 / average request time ms) * number of threads.
  4. what is the smallest number bigger than 90% of my (ordered) data set ?
  5. the growth correlation is kept, we only show here some results. We have more, but it’s pretty much more of the same.
  6. you can easily increase the maximum number of tracked connections, but be aware that each tracked connection eats about 350 bytes of non-swappable kernel memory!
  7. These should be taken with a grain of salt, because we have 1Gbps network equipment, so theoretically, we should hit a practical limit at about 800 Mbps. The “over 1000Mbps” results are probably a glitch in iptraf.
  8. We have a script that generates kickstart files according to machine configuration. The kickstart files install the base CentOS image, and setup the partitions using the raid card command line tool.
  9. The rows represent photo meta data, and most of them have EXIF information, including EXIF comments, etc.
  10. (HBase does not yet have data locality, but after running the cluster for a while, due to compactions, most regions will be local to the regionserver)
  11. When a machine in an HBase cluster dies, the master reassigns the regions that were served by that machine elsewhere. All the machines run both a regionserver and an HDFS datanode, which also gets replicated when a machine dies.

Written by Andrei Dragomir

April 26th, 2010 at 8:18 pm

Posted in Uncategorized

Tagged with , , , ,

17 Responses to 'HBase Performance Testing'

Subscribe to comments with RSS or TrackBack to 'HBase Performance Testing'.

  1. Thanks Andrei for the great article! This is very helpful.

    I’ve got a question for you. (It’s from a friend of mine in Tokyo who has just started a project with HBase and Hadoop MR.)

    So his question is: how do you estimate the appropiate number of file descriptors you’ll need? Does the formula on HBase FAQ #6 just work for you?

    http://wiki.apache.org/hadoop/Hbase/FAQ#A6

    Thanks,

  2. Hi, I’ve got an additional question. While you were running the days-long test, how did the number of file descriptors vary over the time? Was it always related to the number of open regions? or you got any garbage(?) descriptors?

    I think the guy is interested what should be taken care of in the prodcution environment, and file descriptor is one of these stuff they might want to think of.

    Thanks,

  3. Sorry, it seems I failed to post my first question by some reason. And here it is:
    ——–
    Thanks Andrei for the great article! I’ve got a question for you; it’s from a friend of mine in Tokyo who has just started a project with HBase and Hadoop MR.

    So how do you estimate the appropiate number of file descriptors you need? Does the formula on HBase FAQ #6 just work for you?

  4. We have run into issues with file descriptors, but only in the context of running map-reduce jobs. The settings that we use right now on our cluster are:

    dfs.datanode.max.xcievers = 10000, for Hadoop open file connections
    hbase.regionserver.handler.count = 100, for concurrent HBase connections
    the ulimit patches for the number of open files

    We’re monitoring HBASE-24, this should bring an end to the problem.

    Andrei Dragomir

    27 Apr 10 at 10:33 AM

  5. Thank you! I just started to watch HBASE-24 as well.

    I’m just curious; what are these ulimit patches? Maybe you can give me a link to one of them as an example?

  6. Well, basically it’s just like in the HBase FAQ:

    In /etc/security/limits.conf, we have a line that reads:

    “hadoop hard nofile 200000″

    What this says: The Hadoop user can have maximum (hard) 200000 files (nofile) open at any one time.

    Andrei Dragomir

    28 Apr 10 at 1:56 PM

  7. OK. I was a bit confused; I thought there may be some Linux kernel bugs around regarding ulimit. But I was wrong. Thanks a lot!

    Tatsuya Kawano

    28 Apr 10 at 2:11 PM

  8. Forgive a real beginner’s question. What is it that is being tested here. Is a “get a get simply a “read” or in it a “locate and read”. That is, are the times reflated in the charts reflective of the full task of finding an arbitrary record and reading in, or the times of reading records in a sort of “scoop”.

    Thanks for the help.

    Bill Gorman

    6 Aug 10 at 2:41 AM

  9. Bill,

    Get is the actual get method exposed by the HBase API. It will retrieve a “record” from a table based on it’s row, column, qualifier.

    Forgive me if I hadn’t correctly understood your question but In order to get the actual record you need to locate it and then retrieve it’s content. You can read Lars George’s article on HBase storage to get a better idea http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html.

    Cosmin Lehene

    28 Sep 10 at 9:47 AM

  10. Useful article, thank you. This word, “thunkgs” is probably a misspelling of “things”.

    Mark Kerzner

    6 Jan 11 at 11:44 PM

  11. We have javascript send user behavior messages to log files, so to do analytics we want to parse and report on these log files from HBase.

    How would you max out the write-rate and read-rate of log files, where sometimes you’d want to do range-scans over time intervals? I’m thinking like 100 GB of parsed log files going to disk per day, being read in chronological order, where sometimes we’d want to write super fast to do imports of weeks or months of log files as well.

    Because we want to do time range queries, the keys would need to contain a time stamp. In order to do simultaneous writes to different regions, we would need to prepend a random alphanumeric sequence to each key.

    Would it be good to have one hash key correspond to one client writing on one region server? Or what ratio would be recommended?

    Thanks for your thoughts.

    David Beckwith

    10 Mar 11 at 9:51 AM

  12. Thank you for your greate article! It’s very useful. I have got 2 questions for you.

    1. How much is the max heap size of every regionserver in your hbase cluster?
    2. It sounds like you have tested thrift. Does it perform well? As I know, it’s usually slower than Jython or native Java client API of HBase, as mentioned in this article: http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html

    Liang YI

    2 Apr 11 at 10:20 AM

  13. Hey David,

    Maxing out write rate when doing imports depends on your key distribution. If you’re using Hadoop to do the imports, then by default the reducers will get the reduce keys in order. This can lead to hot regions in HBase. There are several ways to improve this. In all cases it’s best to know what’s the key distribution like.
    You could pre-create regions based on the distribution to ensure your writes will go on multiple servers from the beginning. Check out createTable in HBaseAdmin. There’s one that accepts a splits parameter. This will create regions with startkey from the splits array. If you’d like to avoid writing contiguous records and increase write distribution across regions, you could have a hash for the reduce key so that reducers won’t write consecutive records but rather evenly distributing them across region servers.

    Another way to do large imports is to use the HFileOutputFormat. You could set the output of your map reduce job to be HFOF and you’ll end up writing HFiles (this is what HBase uses to store it’s data) directly. Since you’re doing a time-series import you don’t need to go through the actual RegionServer and get a huge performance boost by writing directly to HDFS. Then you could load the HFiles into an existing table or a new one. This is probably the fastest way to do bulk imports.

    In general if reads is the largest percent of operations you’ll do, you need to optimize your schema for that.
    If you want to do range scans over time intervals consider using a yyyyMMdd+HASH pattern. This way you’ll be easily be able to scan for an entire day, month or year. You’d need some form of unique identifier anyway to avoid for key collision.
    More, you could pre-aggregate some of the data to gain speed for the reports you’ll run most often.

    Here are two relevant threads on the hbase mailing list. If you’d like to get feedback on your specific use case, just shoot an email on the list.
    One about LoadIncrementalHFile http://search-hadoop.com/m/Tix7HljaWr1
    The other about speeding up HBase when doing heavy writes http://search-hadoop.com/m/fJ0vh6ojHm1

    Cosmin Lehene

    3 Apr 11 at 12:01 PM

  14. Hi Liang,

    I think for these tests HBase ran with 16GB. It depends on what you run on the machine (e.g. you’d want to have 1GB for each mapper and reducer, etc.)
    We use Thrift for our APIs. It doesn’t add too much overhead and we can easily generate clients for all the languages we need (we mostly use Java, C#, Ruby and Python).

    Cosmin Lehene

    3 Apr 11 at 12:05 PM

  15. Hello,

    can you please explain in more details how you are doing evenly key distribution in case to create evenly load distribution for each node?

    Thanks!

    John

    9 Apr 11 at 2:00 AM

  16. Hi John,

    Key distribution techniques deserves a post on it’s own. However if you start with a single region your writes will still go on a single regionserver. If not, a technique is to play with the reduce keys so that they not sorted in the same order as the HBase keys.

    Cosmin Lehene

    26 May 11 at 11:24 AM

  17. This was an interesting key idea regarding the behaviors involving in testing and performance. Thanks for the great article as well as infographs.

    Danny Yels

    20 Aug 13 at 3:57 AM

Leave a Reply