8

My client is using org.apache.hbase:hbase-client:2.1.0" and the server is running 1.2.0-cdh5.11.1 (1.2.0-cdh5.11.0 in an alternative test cluster).

My client is very simple, it instantiates a Connection class at startup (this class - as recommended by Apache - is shared across threads since it's heavy and thread safe). Then for each request, it creates a Table class and does a .exists(new Get(rowKey)) operation.

So like this:

Connection conn = ConnectionFactory.createConnection(hbaseConfig);

and

final Table table = conn.getTable(tableName);
return table.exists(new Get(context.getKey()));

Most of the time the request latency to HBase and back is 40ms at worst. Usually it completes in around 10ms.

However, we're noticing occasional requests take around 5000ms (5s) - but still complete successfully!

And by occasional I mean around 1 request per min (out of 600 per minute total so a small rate). But it's steady.

These are almost exactly around 5s (+/- 100-200ms). That's the odd part. It's not a spike.

At first I suspected it was a misconfiguration of the client and I needed to set stricter timeouts so I set the following:

hbaseConfig.set(HConstants.HBASE_CLIENT_RETRIES_NUMBER, 1);
hbaseConfig.set(HConstants.HBASE_CLIENT_PAUSE, "50");
hbaseConfig.set(HConstants.HBASE_CLIENT_OPERATION_TIMEOUT, "2000");
hbaseConfig.set(HConstants.HBASE_RPC_TIMEOUT_KEY, "1500");
hbaseConfig.set(HConstants.HBASE_RPC_SHORTOPERATION_TIMEOUT_KEY, "2000");
hbaseConfig.set(HConstants.HBASE_CLIENT_SCANNER_TIMEOUT_PERIOD, "1500");
hbaseConfig.set(HConstants.ZOOKEEPER_RECOVERABLE_WAITTIME, "2000");
hbaseConfig.set(HConstants.ZK_SESSION_TIMEOUT, "2000");
hbaseConfig.set("zookeeper.recovery.retry", "1");
hbaseConfig.set("zookeeper.recovery.retry.intervalmill","200");
hbaseConfig.set("hbase.ipc.client.socket.timeout.connect", "2000");
hbaseConfig.set("hbase.ipc.client.socket.timeout.read", "2000");
hbaseConfig.set("hbase.ipc.client.socket.timeout.write", "2000");

In other words, 5000ms is way over the global timeout (as set in HConstants.HBASE_CLIENT_OPERATION_TIMEOUT).

Yet I have requests that take ~5s to complete - and does so successfully.

In addition to these timeouts I changed from using AsyncConnection to Connection (didn't need it to be async anyway) and am thinking about just making GET calls instead of exists.

But at this point I'm stumped. I'm not seeing any property and where the 5s thing is coming from. It's not even a timeout, it actually succeeds!

Has anyone encountered this before? Is there any way to get hbase-client to emit metrics? Server side metrics show no increase in latency (scan metrics).

encee
  • 4,544
  • 4
  • 33
  • 35
  • I've encountered similar things, due to many factors - GC, auto-flush, etc. you will have to monitor the server to get an idea of what's going on, or at least look at the logs. I strongly recommend setting up monitor dashboard. – shay__ Feb 14 '19 at 16:13
  • We too are facing same exact issue. Please let us know if you could solve it – Srijit Jun 28 '19 at 10:02
  • We have dashboards (including GC, etc). I've even looked at detailed GC logs (I turned on verbose logging) and the 5 spikes didn't line up with any GC pauses. I still haven't solved it. I just cheated by having the client that calls the server timeout after 1s and retry twice. Requests that would've failed succeed now (just a bit slower) as most requests don't hit the 5s timeout. – encee Jul 01 '19 at 17:37
  • What does a thread dump during this window say? How about detailed logs for the request? Is there any ZK activity in this window? – Boris the Spider Jul 02 '19 at 05:04
  • There is only minimal Zookeeper activity during this time. Initially we suspected heavy writes and rebalancing going on at the same time is causing the issue. But we even tried when there is no write happening. Also this is very sporadic. It happens once in every 200 requests or so – Srijit Jul 02 '19 at 05:30
  • How the client measures the operation time? Is the connection created on every request? Perhaps the delay it's not on the operation itself. – LMC Jul 02 '19 at 14:46
  • From client we initiate a Scan(). While looking at the logs, we are seeing the wait is happening right before the scan. And again.. it happens only once in 200-300 requests. We are using one Connection object for all the calls. – Srijit Jul 02 '19 at 14:49
  • If that connection object is not a pool you might have thread contention, i.e. threads waiting for the connection to be released. Another option is that it could be some kind of connection renegotiation. Just thinking out loud :) – LMC Jul 02 '19 at 15:56
  • http://mail-archives.apache.org/mod_mbox/hbase-issues/201112.mbox/%3C1289807203.52709.1325198913502.JavaMail.tomcat@hel.zones.apache.org%3E – Andrii Muzalevskyi Jul 03 '19 at 17:34
  • If possible enable the client logs and check the ping time of connection. If the connection is closed and re-opened because of some reason - this can happen. Are you using the same connection for both reads and writes? – Kris Jul 08 '19 at 08:53

1 Answers1

5

For us, the thread dumps showed that the time was spent on performing DNS resolution using IPv6 implementation and also found a few issues talking about 5 seconds latency in the native library.

    hconnection-0x6fc1d215-shared--pool1-t95" 
   java.lang.Thread.State: RUNNABLE
        at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
        at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
        at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
        at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
        at java.net.InetAddress.getAllByName(InetAddress.java:1193)
        at java.net.InetAddress.getAllByName(InetAddress.java:1127)
        at java.net.InetAddress.getByName(InetAddress.java:1077)
        at java.net.InetSocketAddress.<init>(InetSocketAddress.java:220)
        at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getStubKey(ConnectionManager.java:1802)
        at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getClient(ConnectionManager.java:1772)
        at org.apache.hadoop.hbase.client.ScannerCallable.prepare(ScannerCallable.java:163)
        at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:409)
        at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:134)
        at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)"Timer-47" 

https://github.com/netty/netty/issues/6454

Adding -Djava.net.preferIPv4Stack=true to force using IPv4 stack seems to have fixed this issue for us. We are not seeing the 5 seconds latency anymore (fingers crossed)

ah.narayanan
  • 175
  • 1
  • 11
  • You nailed it.. A thread dump in client side showed that the host lookup is holding up the thread. All the time we were looking at hbase side. – Srijit Jul 10 '19 at 01:27
  • Unfortunately didn't quite work for me. And a thread dump (well multiple of them) didn't show it either. – encee Jul 29 '19 at 17:31
  • are you still seeing exactly 5 seconds latency after using IPv4 based DNS resolution? Did you look on the server side if there are any major compactions going on? – ah.narayanan Aug 01 '19 at 15:46