I've been trying to use the "right" technology for a 360-degree customer application, it requires:
- A wide-column table, each customer is one row, with lots of columns (says > 1000)
- We have ~20 batch update analytics jobs running daily. Each analytics job queries and updates a small set of columns, for all the rows. It includes aggregating the data for reporting, and loading /saving the data for machine learning algorithms.
- We update customers' info in several columns, with <= 1 million rows per day. The update workload is spread out across working hours. We have more than 200 million rows.
I have tried using Hbase, the point 1 and 3 are met. But I found that doing analytics (load/save/aggregate) on HBase is painfully slow, it can be 10x slower than doing with Parquet. I don't understand why, both Parquet and Hbase are columnar DBs, and we have spread out the workload in the HBase cluster quite well ("requests per region" says so).
Any advices ? Am I using the wrong tool for the job ?