Which HBase connector for Spark 2.0 should I use?

Question

Our stack is composed of Google Data Proc (Spark 2.0) and Google BigTable (HBase 1.2.0) and I am looking for a connector working with these versions.

The Spark 2.0 and the new DataSet API support is not clear to me for the connectors I have found:

spark-hbase : https://github.com/apache/hbase/tree/master/hbase-spark
spark-hbase-connector : https://github.com/nerdammer/spark-hbase-connector
hortonworks-spark/shc : https://github.com/hortonworks-spark/shc

The project is written in Scala 2.11 with SBT.

Thanks for your help

Patrick Clay · Accepted Answer · 2018-05-05T01:51:33.277

7

Update: SHC now seems to work with Spark 2 and the Table API. See https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala/bigtable-shc

Original answer:

I don't believe any of these (or any other existing connector) will do all that you would like today.

spark-hbase will probably the right solution when it is release (HBase 1.4?), but currently only builds at head and is still working on Spark 2 support.
spark-hbase-connector only seems to support RDD APIs, but since they are more stable, might be somewhat helpful.
hortonworks-spark/shc probably won't work because I believe it only supports Spark 1 and uses the older HTable APIs which do not work with BigTable.

I would recommend just using HBase MapReduce APIs with RDD methods like newAPIHadoopRDD (or possibly the spark-hbase-connector?). Then manually convert RDDs into DataSets. This approach is a lot easier in Scala or Java than Python.

This is an area that the HBase community is working to improve and Google Cloud Dataproc will incorporate those improvements as they happen.

edited May 05 '18 at 01:51

answered Dec 01 '16 at 19:33

Patrick Clay

1,339
7
5

Thanks for your help, this what I have done for read and it works quite well with `spark.sparkContext.newAPIHadoopRDD(config, classOf[TableInputFormat],classOf[ImmutableBytesWritable],classOf[Result])`. How should I use this API for bulk writes ? – ogen Dec 02 '16 at 10:15
Simply with saveAsNewAPIHadoopDataset(...) – ogen Dec 02 '16 at 10:31
2

Looks like hortonworks released a version for Spark 2: https://github.com/hortonworks-spark/shc/tree/v1.0.1-2.0 – angelcervera Dec 07 '16 at 14:19
Does spark-hbase compatible with Scala 2.11 ? I think it is built for Scala 2.10 https://repository.apache.org/content/repositories/snapshots/org/apache/hbase/hbase-spark/2.0.0-SNAPSHOT/ – Mahmoud Hanafy Apr 24 '17 at 08:25
any update on this again? I want to sort hbase (non-rowkey) column to get rowkeys corresponding to top ten column values. Will doing this in spark using spark-hbase connector run fast? – Mahesha999 May 02 '18 at 12:21

score 2 · Answer 2 · edited Dec 01 '16 at 21:21

2

In addition to the above answer, using newAPIHadoopRDD means that, you get all the data from HBase and from then on, its all core spark. You would not get any HBase specific API like Filters etc. And the current spark-hbase, only snapshots are available.

edited Dec 01 '16 at 21:21

David

11,245
3
41
46

answered Dec 01 '16 at 21:09

Ramzy

6,948
6
18
30

Which HBase connector for Spark 2.0 should I use?

2 Answers2

Linked