Why is HBase full scan and aggregation slower than parquet, despite of also being columnar database?

Question

I've been trying to use the "right" technology for a 360-degree customer application, it requires:

A wide-column table, each customer is one row, with lots of columns (says > 1000)
We have ~20 batch update analytics jobs running daily. Each analytics job queries and updates a small set of columns, for all the rows. It includes aggregating the data for reporting, and loading /saving the data for machine learning algorithms.
We update customers' info in several columns, with <= 1 million rows per day. The update workload is spread out across working hours. We have more than 200 million rows.

I have tried using Hbase, the point 1 and 3 are met. But I found that doing analytics (load/save/aggregate) on HBase is painfully slow, it can be 10x slower than doing with Parquet. I don't understand why, both Parquet and Hbase are columnar DBs, and we have spread out the workload in the HBase cluster quite well ("requests per region" says so).

Any advices ? Am I using the wrong tool for the job ?

score 2 · Accepted Answer · answered Jul 16 '18 at 07:43

2

both Parquet and Hbase are columnar DBs

This asumption is wrong:

Parquet is not a database.
HBase is not a columnar database. It is frequently regarded as one, but this is wrong. HFile is not columnar oriented (Parquet is).

HBase is painfully slow, it can be 10x slower than doing with Parquet

HBase full scan is generally much slower than the equivalent HDFS raw file scan as HBase is optimized for random access patterns. You didn't specify how exactly did you scan the table - TableSnapshotInputFileFormat is much faster than the naive TableInputFormat, yet still slower than raw HDFS file scan.

answered Jul 16 '18 at 07:43

shay__

3,815
17
34

We are using Phoenix on HBase and Spark SQL dataframe to read / aggregate / write to HBase. The version is HDP 2.5.0, with HBase 1.1.2, Phoenix 4.7 and Spark 1.6 – Tung Vs Jul 17 '18 at 09:29
@TungVs Last time I checked, all hbase connectors to spark were using `TableInputFormat` by default. – shay__ Jul 17 '18 at 09:44
If we need Spark to connect to Hbase, is there anything we can do to improve the performance, or is HBase just not the right tool for the job, or is there a "standard" way / architecture to solve our problem that I have not been aware of ? Recently I have considered in-memory solutions like Ignite, but it's a row-based DB, hence if we need just several columns to be cached, it would not be possible (most likely we wouldn't have enough RAM to cache all the table). Another option is Kudu, but its big turn-off is 300 columns limitation, we have more than 1000. – Tung Vs Jul 17 '18 at 14:42
I've seen that there would be no one to answer these questions anymore. For your answer, it would be sufficient to clarify the original title (why Hbase scan is slow), so I will accept it and fork another thread for further questions. Thank you for your time. – Tung Vs Jul 22 '18 at 07:24

Why is HBase full scan and aggregation slower than parquet, despite of also being columnar database?

1 Answers1