Performance is one of the most interesting characteristics in a system’s behavior. It’s challenging to talk about it, because performance measurements need to be accurate and in depth.
Our purpose is to share our reasons for doing performance testing, our methodology as well as our initial results, and their interpretation. Hopefully, this will come in handy for other people.
The key take-aways here are that:
- Performance testing helps us determine the cost of our system; it helps size the hardware appropriately, so we don’t introduce hardware bottlenecks or spend too much money on expensive equipment.
- A black-box approach (only the actual test results: average response time) is not enough. You need to validate the results by doing an in-depth analysis.
- We test not only our software, but try to look at all the levels, including libraries and operating system. Don’t take anything for granted.
So, why do we run performance tests? Apart from our peace of mind of whether our solution is good enough :D, here are some pragmatic reasons, that will hopefully explain what we measure and how:
To keep our customers happy, our service needs to work fast. The hardware has to be used efficiently (more bang for your buck); performance tests help us detect the limits of our entire stack (software / operating system / hardware), as well as estimate and plan capacity.
Unfortunately, testing only isolated components can hide issues that appear only in a distributed context. Cluster load tests are sometimes the only way to ensure that the system behaves properly.
When you provide a service, you need to think about KPIs and SLAs. You have to tell your prospective clients how fast your system is, and what performance they can expect under a specific load. You can only guarantee performance under a certain load, so you have to design your cluster for it.
We run these tests regularly, so we can compare performance. This allows us to maintain the initial performance baseline, or at least know how our changes impact it.
Our view on types of performance tests
We run several kinds of performance tests:
- Benchmark: we run these light tests automatically, on the continuous integration server. These give us “comparative” results between any pair of HStack builds / deployments. They give a high-level and inaccurate measurement of performance increase / decrease from commit to commit. So, if anybody adds a synchronized statement that serializes everything, we’ll know :).
- Load testing: tests should emulate a load similar to the normal or peak production conditions. We first run a performance test with 1 client. Then, when running with multiple clients, we can examine the behavior under concurrent access. Up to a threshold, the results should be the same.
- Stress testing: test with harder conditions than the normal load, and determine the limits of the system. Sometimes the resources of the system are intentionally reduced, to see the effect of the load. How does the system behave once you pull the plug on a server? Or three?
- Longevity tests: run a set of tests (for example, load tests), for a long time, to see how the system behaves under sustained load (for days). Does Java GC come into play? Will it swap?
We have two main use cases we care about: real-time read / write, and processing(map/reduce jobs).
Random Read / Write Results
All of our tests have been performed on our 7 server cluster, using Grinder. Each machine has dual Nehalem CPUs (8 cores), 32GB DDR3 and 24 x 146GB SAS 2.0 10K RPM disks.
The HBase test table is a long table1, 3 billion records, in about 6600 regions. The data size is between 128-256 bytes per row, spread in 1 to 5 columns. We’ll probably start testing with 1KB records for the next tests.
We use 4 injector servers (the thread count in the test results below is split equally between them). The injectors hit an F5 load balancer that sits in front of our cluster. We run get, put and combined tests (a mix of 70% gets and 30% puts).
We use a uniform random distribution to generate the keys for reads and writes, trying to hit a balanced amount of “hot” and “cold” data2. We make sure that the data is at least one order of magnitude higher than the cluster’s RAM capacity in order to exercise the disks. Another good approach is to try to hit 100% IO-utilization on your cluster’s disks (or as high as possible) to guarantee that you understand the load of your system when dealing with real-world data set size.
The load tests were performed using Hbase 0.21 (TRUNK), the one available on 02.25.2010. Hbase was running on Hadoop 0.21 trunk (same thing, the version available around that date). We’ll publish updates here or in another post with updated results, with newer versions.
|Type||Threads||Mean Time (ms)||Time Standard Deviation (ms)||90% percentile line (ms)||Throughput3 (req / s)||Number of Requests|
|Get / Put||300||14.62||15.83||33.976||20520||2999994|
|Get / Put||600||26.48||33.10||64.336||22659||5999783|
Here’s how the mean request time distribution looks like overall:
And here’s how it looks like zoomed in on the requests that took less than 20 ms:
Mean request time by itself doesn’t say much about the real performance of your system. We also output data distribution indicators.
The standard deviation can expose problems hidden by the average: big skew between request times (Take two datasets [5ms, 5ms] and [0.1ms, 9.9ms] . They both have the same average of 5ms, but the standard deviation is different).
We calculate the 90% percentile line4 with R. This eliminates the longest running 10% percent of requests from the data set, which can distort your average. R is used to generate the histograms with the entire distribution of request times as well.
Statistics can help you augment the initial results. A basic data set can be “enriched” and carved in different ways, that help you draw smarter conclusions. The least we can do is look at some indicators and decide if we are measuring the right things.
Overall, writes are faster than reads. See here for an explanation on write-path in HBase. Basically, the WAL is an HDFS file, which is kept in memory on 3 replicas (depends on your replication factor), and flushed asynchronously. What we’d like to see here in terms of performance is single-row write latency approach the hard drive seek time.
For reads, the average request time increases with the number of threads5. This is somewhat expected, but only from a certain number of threads. Again, here we’d like to see almost linear performance up to, let’s say 200 concurrent clients. In our case, the growth is pretty much linear. Corroborated with low utilization on the cluster, this leads us to believe that there are still contention points in the HBase code. We solved some of them, but there is still work to be done here.
Lies, damned lies, and statistics
In and of itself, saying “we can do X requests/sec with a 95% percentile line of Y ms” won’t tell you much about what’s happening under the covers. The kind of test that we showed above is much like a black box. The indicators you get or compute can tell you, at best, whether your system is “broken”. There are other thunkgs that we’d like to know, most important being hardware usage.
The numbers are the tip of the iceberg; things become really interesting once we start looking under the hood, and interpreting the results.
When investigating performance issues you have to assume that “everybody lies”. It is crucial that you don’t stop at a simple capacity or latency result; you need to investigate every layer: the performance tool, your code, their code, third-party libraries, the OS and the hardware. Here’s how we went about it:
The first potential liar is your test, then your test tool – they could both have bugs so you need to double-check.
After you make sure that your tests are correct, you need to determine the testing tool overhead. For example, in Grinder, you write a test function in a Jython class, which Grinder wraps and instruments in order to get performance statistics. We added our own “witness”, in the inner most loop of the test function to give us the test time with
java.lang.System.nanoTime to make sure that it’s correct.
As I mentioned, our service is a thin layer, a shim over HBase that offers a thrift interface, and a couple of other things. We did the tests both with and without this layer(thrift), so we can spot any potential problems in it. It’s a good exercise to ask yourself: what’s the time added by each layer of my system? For us, right now, it’s negligible, but in the future, we can expect this overhead to grow, and we have a baseline for its value.
Down and up again
At first we had erratic results: No two performance runs were the same, and our servers were almost idling. We started to get rid of the moving pieces in order to identify the hot spot. First we removed thrift, than the entire testing framework and wrote a simple test case (basically a long Java loop) which ran that on the cluster. Then we got rid of the cluster and ran the test on a single node. We profiled the code, and then got rid of the large table too. We ended up with running the test program, locally, on a table with one row (got rid of the I/O overhead in the equation, everything was served from memory).
Finally, after looking at the profiler logs and with the help of the guys in
#hbase, we got to HBASE-2180. After applying the patch, we went back up the stack, and added all the layers of complexity back, one by one. In the end server load increased, mean request time decreased and the results were consistent.
One problem with this approach is that by trying to pinpoint the issue, you can lose track of the big picture and might even end up hiding the initial cause. You have to measure performance every step of the way to make sure the problem still reproduces.
The second big problem was that sometimes the throughput went down to 0 req/s for short periods of times (blackouts). We first looked at all logs on all the nodes, seeing some weird timeouts between them. At the time, the servers and network were not under heavy load. After a lot of debugging we turned to the OS logs (there was nothing left :): in the
dmesg output, iptables was screaming out
ip_conntrack: table full, dropping packet.. It turns out that on busy network servers, the firewall can get to a point where it will merrily drop connections 6, if the length of the connection queue is bigger than the default cap.
We were so fixated on identifying problems in the upper layers, that we totally forgot about the operating system. It usually IS your code, but don’t take anything for granted when identifying performance problems. Low-level components, like network and file system play a big part in performance tests.
Performance numbers: Hard Mode
How do you determine when you hit the limits of your system? More than that, how do you identify the actual problems starting from the initial results? Simply pushing on until something crashes or stops won’t do it. What’s worse, it won’t tell you anything about whether you are walking through the hardware sweet spot, or if you are spending money in the right place.
In order to identify the limits or bottlenecks you need a way to look inside the system – something like an x-ray. For code there are profilers, instrumentation and log files. Linux and derivatives have plenty of tools that you can use to get information on how CPU, memory and I/O cycles (both HDD and network) are spent. We use an ad-hoc combination of top, htop, iostat, dstat, netstat, iptraf, sar. Most of these are frontends to
/proc, and functionality is duplicated between them. We look at different things, like overall network bandwidth used, CPU %, number of page faults, I/O queues average length, etc.
Can you give me an example?
Here are some interesting statistics obtained by real-time monitoring of our cluster when doing load tests:
- During writes, the most relevant statistic is the network bandwidth: 600 Mbps – 800 Mbps, peaking over 1000Mbps, ~670Mbps incoming traffic7. The iostat service time (
svctm) rises to a couple of milliseconds, CPU usage is split in half (50%-50%) between user and kernel space, and overall at 30% (or 70% idle).
- Cumulative read/write statistics are about 300-600 Mbps, peaking at 900 Mbps.
- During read tests, things are different; network bandwidth utilization is about 10 times less (60 Mbps) – a row is small, there is no region reassignment churn. Also iostat await time is about 7ms, and average queue size is about 1. Again, the CPU is not used.
Locks, or rather sub-optimal locks in the code are an “invisible resource”. Utilization patterns where CPU is low, network bandwidth is low and the disks can keep up with the requests indicate some type of resource or lock contention in a system.
Another problem we identified using iostat was that on one hard drive, we had twice the number of requests and twice the wait time. It was due to a bug in our kickstart8 bootstrap scripts, which instead of setting up one partition on each drive, created two partitions on the last but one. Performance testing not only gives you a baseline but it helps validate the configuration of your system.
Map-Reduce performance, and implicitly scan performance is very good. For testing, we are just running the map-reduce job itself. We have enough data and machines that we think we don’t need another test harness. We simply run the map-reduce jobs and compute statistics at the end.
Our real world jobs crunch through a large table, with ~ 530M large rows9. For map/reduce, we are crunching through this data with about ~50000 req/s. We are able to process the entire table for statistics in about 2h and 40. For processing smaller rows (like reading all the rows in the 3billion records test table), we see 200000 req/s throughput (peaking at 400000 req/s).
Tight Hadoop integration is one of HBase’s core features, and we want to see this translated in great Map/Reduce performance. The most important statistic here is CPU usage, which is high (60%-70%), with all cores used, followed by network bandwidth utilization10.
We are very careful what we spend our money on. Under-dimensioned hardware is a performance killer, but over-dimensioned one increases our costs. There’s no reason in sticking 20 hard drives in a server if the CPU / mainboard can’t handle it. In the same vein, having huge processing power and memory is useless when your network communication is strangled by a 100Mbps network board, or even worse, by another poorly dimensioned equipment down the road (core switch, etc). All the components of a machine (and by extension, of the cluster) have to have the correct size.
Our configuration uses 24 hard disks per server. Increasing the number of spindles significantly improves I/O by parallelization. But there’s always a push / pull relation between failover, performance and costs.
For lots of nodes, processing power (MapReduce) is higher, but you eat more power / network traffic in the data center (which translates in bigger costs). On the other hand, with smaller number of beefier servers, power utilization is smaller, but you run into the risk of having big churn during failover (a machine dies)11, because there’s a lot of data that need to get re-balanced in HDFS.
The point is that we’re careful about costs. We try to be thorough, get an in-depth look on how our hardware is performing, and decide whether we need to upscale or downscale our components for optimal usage.
Therefore in our production cluster, we decided to go with 16 2.5″ disks instead of 24, trading spindles for less power usage. 16 disks are enough for our current usage, and we can always add more when we run out of space.
Hammers, anvils and other assorted sundries
We use a bunch of great tools that really made life easier for us:
The main performance tool that we use is Grinder. Grinder lets you write your tests in Jython, and plug them in with whatever you want to test(DNS speed or weird protocols ? No problem). HTTP is built in; it has a console that lets you orchestrate your tests across multiple injector nodes. Grinder has a steeper learning curve, but it’s worth it, and we highly recommend it.
I already mentioned all the system monitoring tools that we use. When the performance tests are finished, we use R to load the data sets. R is great because it allows for an incremental approach to data analysis. It’s really fast, and you can dice the data, look at the distributions, eliminate outliers, etc. Also highly recommended.
More specific details on how we actually use everything, with recipes, in a later post. Also, now that YCSB just came out as open-source, we’ll give that a go as well.
We’re happy with our current results. We know that there are a lot of small things we need to do that will improve performance. Our current focus is to shape the whole load test methodology so that we can decide on the best cost / performance hardware configuration.
- because of the columnar storage, tables grow in both dimensions, like an excel sheet ↩
- The initial 3 billion rows in the table were added with a map/reduce job. ↩
- The throughput in the table is computed by using the formula
(1000 / average request time ms) * number of threads. ↩
- what is the smallest number bigger than 90% of my (ordered) data set ? ↩
- the growth correlation is kept, we only show here some results. We have more, but it’s pretty much more of the same. ↩
- you can easily increase the maximum number of tracked connections, but be aware that each tracked connection eats about 350 bytes of non-swappable kernel memory! ↩
- These should be taken with a grain of salt, because we have 1Gbps network equipment, so theoretically, we should hit a practical limit at about 800 Mbps. The “over 1000Mbps” results are probably a glitch in iptraf. ↩
- We have a script that generates kickstart files according to machine configuration. The kickstart files install the base CentOS image, and setup the partitions using the raid card command line tool. ↩
- The rows represent photo meta data, and most of them have EXIF information, including EXIF comments, etc. ↩
- (HBase does not yet have data locality, but after running the cluster for a while, due to compactions, most regions will be local to the regionserver) ↩
- When a machine in an HBase cluster dies, the master reassigns the regions that were served by that machine elsewhere. All the machines run both a regionserver and an HDFS datanode, which also gets replicated when a machine dies. ↩