Hive over HBase for deep analytical query

Question

I am able to connect and access an existing HBase table with Hive (using Hive HBase Storage Handler).

I think the interface is not much powerful. Can this interface be used for large analytical data processing?

You may be interested in [What is Hive](https://stackoverflow.com/questions/20030436/what-is-hive-is-it-a-database)? Hive just acts as a kind of (job) compiler. This means Hive will translate your SQL like statements (filter or join) into native code for the Hadoop framework. To do so, it compiles a job for the cluster, where it is sent to and become executed. It is great for splitting up workloads into smaller pieces and distribute the work over a lot of nodes. — U880D, May 24 '18 at 14:43

score 2 · Answer 1 · answered May 24 '18 at 16:09

2

No It can't. Any WHERE clause ends up as a full SCAN in HBase table and scans are extremely slow. Please check https://phoenix.apache.org/ as an alternative.

answered May 24 '18 at 16:09

Harold

455
4
10

i am also using apache phoenix. but i am using it for oltp applications. but for analytic purpose i think it's performance will based on the rowkey & it can affect the oltp applications. – Rahul May 29 '18 at 12:08

score 1 · Answer 2 · answered May 24 '18 at 16:20

1

Apache Phoenix is more applicable for querying HBase. you can also query HBase using Hive, then your query will get converted in Map Reduce Job which will take more time then Phoenix.

PS : You can use Hive for Big Data analytics even if you are using Hbase.

answered May 24 '18 at 16:20

Gaurav Varshney

502
5
15

1

only for oltp applications i am also using apache phoenix. i want both oltp and olap. that's why i am trying hive over existing hbase for olap. – Rahul May 29 '18 at 12:03

Chema · Answer 3 · 2020-07-21T20:59:07.043

A good solution to do Analytic queries over HBase faster is combine HBase with Hive and Impala.

As an example of this would be the following scenario:

I have a Kafka producer receiving thousands of signals from IoT devices from a socket in json format. I am processing this signals with a consumer in Spark streaming and putting these signals in a HBase table.

HBase table and data example

$ hbase shell
hbase> create_namespace 'device_iot'
hbase> create 'device_iot:device', 'data'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:deviceID', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23c'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:temperature', '12'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:latitude', '52.14691120000001'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:longitude', '11.658838699999933'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:time', '2019-08-14T23:30:30000'

Hive table on top of HBase table

CREATE EXTERNAL TABLE t_iot_devices (
id string, deviceID string, temperature int, latitude double, longitude double, time string) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:deviceID,data:temperature,data:latitude,data:longitude,data:time") 
TBLPROPERTIES("hbase.table.name" = "device_iot:device");

Querying in Impala

impala> invalidate metadata;
SELECT deviceID, max(temperature) AS maxTemperature 
FROM t_iot_devices 
GROUP BY deviceID;

+--------------------------------------+----------------+
| deviceid                             | maxtemperature |
+--------------------------------------+----------------+
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23b | 39             |
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23a | 39             |
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23c | 39             |
+--------------------------------------+----------------+

SELECT deviceID, substr(time,1,10) AS day, max(temperature) AS highest 
FROM t_iot_devices 
WHERE  substr(time,1,10) = '2019-07-07' 
GROUP BY deviceID, substr(time,1,10);

+--------------------------------------+------------+---------+
| deviceid                             | day        | highest |
+--------------------------------------+------------+---------+
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23c | 2019-07-07 | 34      |
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23b | 2019-07-07 | 35      |
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23a | 2019-07-07 | 22      |
+--------------------------------------+------------+---------+

Hive over HBase for deep analytical query

3 Answers3