I am able to connect and access an existing HBase table with Hive (using Hive HBase Storage Handler).
I think the interface is not much powerful. Can this interface be used for large analytical data processing?
I am able to connect and access an existing HBase table with Hive (using Hive HBase Storage Handler).
I think the interface is not much powerful. Can this interface be used for large analytical data processing?
No It can't. Any WHERE clause ends up as a full SCAN in HBase table and scans are extremely slow. Please check https://phoenix.apache.org/ as an alternative.
Apache Phoenix is more applicable for querying HBase. you can also query HBase using Hive, then your query will get converted in Map Reduce Job which will take more time then Phoenix.
PS : You can use Hive for Big Data analytics even if you are using Hbase.
A good solution to do Analytic queries over HBase
faster is combine HBase
with Hive
and Impala
.
As an example of this would be the following scenario:
I have a Kafka producer
receiving thousands of signals from IoT devices from a socket in json
format. I am processing this signals with a consumer in Spark
streaming and putting these signals in a HBase
table.
HBase table and data example
$ hbase shell
hbase> create_namespace 'device_iot'
hbase> create 'device_iot:device', 'data'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:deviceID', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23c'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:temperature', '12'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:latitude', '52.14691120000001'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:longitude', '11.658838699999933'
hbase> put 'device_iot:device', '11c1310e-c0c2-461b-a4eb-f6bf8da2d23a-1509793235', 'data:time', '2019-08-14T23:30:30000'
Hive
table on top of HBase
table
CREATE EXTERNAL TABLE t_iot_devices (
id string, deviceID string, temperature int, latitude double, longitude double, time string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:deviceID,data:temperature,data:latitude,data:longitude,data:time")
TBLPROPERTIES("hbase.table.name" = "device_iot:device");
Querying in Impala
impala> invalidate metadata;
SELECT deviceID, max(temperature) AS maxTemperature
FROM t_iot_devices
GROUP BY deviceID;
+--------------------------------------+----------------+
| deviceid | maxtemperature |
+--------------------------------------+----------------+
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23b | 39 |
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23a | 39 |
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23c | 39 |
+--------------------------------------+----------------+
SELECT deviceID, substr(time,1,10) AS day, max(temperature) AS highest
FROM t_iot_devices
WHERE substr(time,1,10) = '2019-07-07'
GROUP BY deviceID, substr(time,1,10);
+--------------------------------------+------------+---------+
| deviceid | day | highest |
+--------------------------------------+------------+---------+
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23c | 2019-07-07 | 34 |
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23b | 2019-07-07 | 35 |
| 11c1310e-c0c2-461b-a4eb-f6bf8da2d23a | 2019-07-07 | 22 |
+--------------------------------------+------------+---------+