My client is using org.apache.hbase:hbase-client:2.1.0"
and the server is running 1.2.0-cdh5.11.1
(1.2.0-cdh5.11.0
in an alternative test cluster).
My client is very simple, it instantiates a Connection
class at startup (this class - as recommended by Apache - is shared across threads since it's heavy and thread safe). Then for each request, it creates a Table
class and does a .exists(new Get(rowKey))
operation.
So like this:
Connection conn = ConnectionFactory.createConnection(hbaseConfig);
and
final Table table = conn.getTable(tableName);
return table.exists(new Get(context.getKey()));
Most of the time the request latency to HBase and back is 40ms at worst. Usually it completes in around 10ms.
However, we're noticing occasional requests take around 5000ms (5s) - but still complete successfully!
And by occasional I mean around 1 request per min (out of 600 per minute total so a small rate). But it's steady.
These are almost exactly around 5s (+/- 100-200ms). That's the odd part. It's not a spike.
At first I suspected it was a misconfiguration of the client and I needed to set stricter timeouts so I set the following:
hbaseConfig.set(HConstants.HBASE_CLIENT_RETRIES_NUMBER, 1);
hbaseConfig.set(HConstants.HBASE_CLIENT_PAUSE, "50");
hbaseConfig.set(HConstants.HBASE_CLIENT_OPERATION_TIMEOUT, "2000");
hbaseConfig.set(HConstants.HBASE_RPC_TIMEOUT_KEY, "1500");
hbaseConfig.set(HConstants.HBASE_RPC_SHORTOPERATION_TIMEOUT_KEY, "2000");
hbaseConfig.set(HConstants.HBASE_CLIENT_SCANNER_TIMEOUT_PERIOD, "1500");
hbaseConfig.set(HConstants.ZOOKEEPER_RECOVERABLE_WAITTIME, "2000");
hbaseConfig.set(HConstants.ZK_SESSION_TIMEOUT, "2000");
hbaseConfig.set("zookeeper.recovery.retry", "1");
hbaseConfig.set("zookeeper.recovery.retry.intervalmill","200");
hbaseConfig.set("hbase.ipc.client.socket.timeout.connect", "2000");
hbaseConfig.set("hbase.ipc.client.socket.timeout.read", "2000");
hbaseConfig.set("hbase.ipc.client.socket.timeout.write", "2000");
In other words, 5000ms is way over the global timeout (as set in HConstants.HBASE_CLIENT_OPERATION_TIMEOUT
).
Yet I have requests that take ~5s to complete - and does so successfully.
In addition to these timeouts I changed from using AsyncConnection
to Connection
(didn't need it to be async anyway) and am thinking about just making GET
calls instead of exists
.
But at this point I'm stumped. I'm not seeing any property and where the 5s thing is coming from. It's not even a timeout, it actually succeeds!
Has anyone encountered this before? Is there any way to get hbase-client to emit metrics? Server side metrics show no increase in latency (scan metrics).