Which is the best HBase connector to use for batch loading data into HBase from Spark?

Question

As mentioned also in Which HBase connector for Spark 2.0 should I use? mainly there are two options:

RDD based https://github.com/apache/hbase/tree/master/hbase-spark
DataFrame based https://github.com/hortonworks-spark/shc

I do understand the optimizations and the differences with regard to READING from HBase.

However it's not clear for me which should I use for BATCH inserting into HBase.

I am not interested in one by one records, but by high throughput.

After digging through code, it seems that both resort to TableOutputFormat, http://hbase.apache.org/1.2/book.html#arch.bulk.load

The project uses Scala 2.11, Spark 2, HBase 1.2

Does the DataFrame library provide any performance improvements over the RDD lib specifically for BULK LOAD ?

The RDD code can easily be rewritten to a Dataset based API. I highly doubt that affects the throughput. Just different libraries doing similar things in a different way — OneCricketeer, Nov 08 '17 at 14:44
for the RDD code there is an example of BulkPut. The client code aggregates PUT requests and sends them in one batch to the HBaser server. However, for the DF code, it's not clear how the batch is working. There is no clear example with a difference between Batch Insertion or inserting a single element. — cipri.l, Nov 08 '17 at 14:52

score 4 · Answer 1 · answered Aug 31 '19 at 01:46

Lately, hbase-spark connector has been released to a new maven central repository with 1.0.0 version and supports Spark version 2.4.0 and Scala 2.11.12

  <dependency>
     <groupId>org.apache.hbase.connectors.spark</groupId>
     <artifactId>hbase-spark</artifactId>
     <version>1.0.0</version>
   </dependency>

This supports both RDD and DataFrames. Please refer spark-hbase-connectors for more details

Happy Learning !!

Sachin Thapa · Answer 2 · 2017-11-09T15:48:50.507

2

Have you looked at bulk load examples on Hbase project.

See Hbase Bulk Examples, github page have java examples, you can write scala code easily.

Also read Apache Spark Comes to Apache HBase with HBase-Spark Module

Given a choice RDD vs DataFrame, we should use DataFrame as per recommendation on official documentation.

A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.

Hoping this helps.

Cheers !

edited Nov 09 '17 at 15:48

answered Nov 08 '17 at 18:39

Sachin Thapa

3,559
4
24
42

thanks for your reply. I am aware of the bulk loads examples for RDD library. However, I don't know if there is anything similar for the DataFrame library. Does the DF library provide any performance improvements over the RDD lib specifically for BULK LOAD ? – cipri.l Nov 09 '17 at 07:07
You should use DataFrame it is more optimized and is recommended post spark 2.x – Sachin Thapa Nov 09 '17 at 15:49
DataFrame is more optimized for processing data. I need only bulk insert, no data processing. I am not interested in querying data from hbase. The optimizations you mention are related to processing done by spark. – cipri.l Nov 10 '17 at 10:11
DataSet can be used instead of DataFrame because DataFrame is not type safe. So any mistake during development only can be found during runtime. No compile time error. As a developer DataSet is more appropriate which is type safe. – Anup Ghosh Feb 05 '19 at 07:27

Which is the best HBase connector to use for batch loading data into HBase from Spark?

2 Answers2