HBase Multithreaded Scan is really slow

Question

I'm using HBase to store some time series data. Using the suggestion in the O'Reilly HBase book I am using a row key that is the timestamp of the data with a salted prefix. To query this data I am spawning multiple threads which implement a scan over a range of timestamps with each thread handling a particular prefix. The results are then placed into a concurrent hashmap.

Trouble occurs when the threads attmept to perform their scan. A query that normally takes approximately 5600 ms when done serially takes between 40000 and 80000 ms when 6 threads are spawned (corresponding to 6 salts/region servers).

I've tried to use HTablePools to get around what I thought was an issue with HTable being not thread-safe, but this did not result in any better performance.

in particular I am noticing a significant slow down when I hit this portion of my code:

for(Result res : rowScanner){
//add Result To HashMap

Through logging I noticed that everytime through the conditional of the loop I experienced delays of many seconds. These delays do not occur if I force the threads to execute serially.

I assume that there is some kind of issue with resource locking but I just can't see it.

I also noticed that I'm using OpenJDK and not the Oracle JDK that came with my HBase software package (Cloudera). I'm not sure if this may be an issue but I heard there might be some problem with threading issues between different JDK's. — shadonar, Jan 19 '12 at 22:23
I am new to HBase myself, but what are the Xms/Xmx settings for the HBase server's JVM? It could be that HBase is forced to run to many GC runs that slow things down. — Behrang, Jan 20 '12 at 08:08
I realized my HBase setup was only storing this data in one region server. It apparently had a Block size too large for the amount of data I had in HBase. I changed the Block size and put more data in just to make sure there were multiple region servers and plenty of data. This resulted in threads returning faster than the previous sequential runs, but it wasn't by some factor of the number of threads. It returned, with 6 threads running, in around 3400 ms. Thank you all for your help. These were all great suggestions and I will continue to utilize them in order to increase my performance — shadonar, Jan 25 '12 at 19:26

score 5 · Accepted Answer · edited May 23 '17 at 10:28

5

Make sure that you are setting the BatchSize and Caching on your Scan objects (the object that you use to create the Scanner). These control how many rows are transferred over the network at once, and how many are kept in memory for fast retrieval on the RegionServer itself. By default they are both way too low to be efficient. BatchSize in particular will dramatically increase your performance.

EDIT: Based on the comments, it sounds like you might be swapping either on the server or on the client, or that the RegionServer may not have enough space in the BlockCache to satisfy your scanners. How much heap have you given to the RegionServer? Have you checked to see whether it is swapping? See How to find out which processes are swapping in linux?.

Also, you may want to reduce the number of parallel scans, and make each scanner read more rows. I have found that on my cluster, parallel scanning gives me almost no improvement over serial scanning, because I am network-bound. If you are maxing out your network, parallel scanning will actually make things worse.

edited May 23 '17 at 10:28

Community

1
1

answered Jan 19 '12 at 22:08

Chris Shain

50,833
6
93
125

I attempted to use the BatchSize and it resulted in actually losing some of the data I was querying for. I noticed very minimal changes when changing the Caching on the Scan objects. I'll try them again and see if there was just something I was missing. just tested again with a changed BatchSize and I didn't lose any data, but it took about 99k ms. not exactly a performance increase. – shadonar Jan 19 '12 at 22:12
Don't set them too high- 100 or so should do. If you set them too high, your scanners may time out on the regionserver while your client is processing the current batch. – Chris Shain Jan 19 '12 at 22:13
I set the BatchSize to 1000 and Caching to 200 and those were the results i got. I'm testing again with BatchSize set to 100 or so and see if that helps any. ..... cache @ 200 with batch disabled = 79k ms cache @ 200 with batch @ 1000 = 99.9k ms cache @ 200 with batch @ 100 = 82.6k ms – shadonar Jan 19 '12 at 22:19
This doesn't appear to help I keep getting apparently random return times. I commented out the BatchSize and ran with those same cache values and others to see if there's a pattern to it (like increasing the cache decreases my time) but nothing was remotely the same the 2nd or 3rd or even 4th time through. – shadonar Jan 19 '12 at 22:34

score 1 · Answer 2 · answered Jan 23 '12 at 05:00

Have you considered using MapReduce, with perhaps just a mapper to easily split your scan across the region servers? It's easier than worrying about threading and synchronization in the HBase client libs. The Result class is not threadsafe. TableMapReduceUtil makes it easy to set up jobs.

HBase Multithreaded Scan is really slow

2 Answers2