Apache Spark/Scala Bulk inset/update to HBASE

Question

Below is my scenario:

Initially load data to HBASE using Sqoop (This is done)
Now, I will get batches of data on daily basis (around 600000 records) which is a combination of new data (for inserting the new records to HBASE) and old data (for updating the existing records of HBASE). Now my question is:

How can I perform this operation using Spark/scala to Hbase table.

Your early reply would be highly appreciated.

Thanks Souvik

Which API (RDDs, DataFrames, Datasets) and Spark version are you using? — Anton Okolnychyi, Dec 19 '16 at 08:39
Hi Anton, I am using the spark version 1.6.1 and API is dataframe. — Souvik, Dec 19 '16 at 09:01
If you provide me any sample code, that will really help me. — Souvik, Dec 19 '16 at 09:08

score 0 · Answer 1 · edited May 23 '17 at 11:46

0

I would advise you to read answers to this question to get an overview.

In my answer there, I mention several options that you can use:

hbase-spark, a module that is available directly in the HBase repo
Spark-on-HBase by Hortonworks

Since you are using Spark 1.6.1, you can use any of them. An example of working with DataFrames in hbase-spark can be found here, while a similar example for Spark-on-HBase can be found here.

edited May 23 '17 at 11:46

Community

1
1

answered Dec 19 '16 at 09:35

Anton Okolnychyi

936
7
10

Hi Anton: If I use Hive-on-Hbase package (yum install hive-hbase) for bulk insert/update operation then which API will give better performance? I can execute this command through Spak itsel. – Souvik Dec 26 '16 at 06:48

Apache Spark/Scala Bulk inset/update to HBASE

1 Answers1