1

I would like to use BigTable as a sink for a Flink job:

  1. Is there a connector out-of-the-box ?
  2. Can I use Datastream API ?
  3. How can I optimally pass a sparse object (99% sparsity), i.e. ensure no key/value are created in BigTable for nulls ?

I have searched the documentation for the above topics but couldn't answer those questions.

Thanks for your support !

py-r
  • 419
  • 5
  • 15
  • 1
    Cloudera addresses this use case and refers to a [Flink HBase connector](https://docs.cloudera.com/csa/1.2.0/datastream-connectors/topics/csa-hbase-connector.html). It seems that it can be [manually installed](https://stackoverflow.com/a/46887749/9457843). You will notice in [the example](https://docs.cloudera.com/csa/1.2.0/datastream-connectors/topics/csa-hbase-configuration.html) there is a piece of code where the columns are added with `put.addColumn`, so, in that section you can evaluate if it is null and discard it. Since BigTable can be accessed with HBase API it is possible it works. – rsantiago Feb 04 '21 at 22:57
  • Thanks for your input ! Any idea if this is same connector as the one referred to by @igordvorzhak ? – py-r Feb 05 '21 at 04:00

1 Answers1

1

I do not think that Flink has a native BigTable connector.

That said, you can use Flink HBase SQL Connector with BigTable HBase client to access BigTable from Flink:

Flink job <-> Flink HBase SQL Connector <-> BigTable HBase client <-> BigTable

This connector appears to be similar as the Flink HBase connector proposed by Cloudera and that can be manually installed (see comment @rsantiago).

A possible approach regarding sparse data persistence could be taken from Cloudera's example where columns are added with put.addColumn so that in you could evaluate in that section if it is null and discard it (see comment @rsantiago).

py-r
  • 419
  • 5
  • 15
Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
  • Thanks. I want to avoid persisting null values (within a row or a column) not excluding an entire row or column, which is what the approach suggests in the Table API of Flink, which would mean data loss. Isn't there an option in the connector itself ? – py-r Jan 31 '21 at 20:41
  • Would the proposed approach work now that we discussed [same use case in Spark](https://stackoverflow.com/questions/65647574/spark-hbase-bigtable-wide-sparse-dataframe-persistence?noredirect=1#comment116720157_65647574) ? Any precision appreciated here. Thanks ! – py-r Feb 02 '21 at 19:17
  • Theoretically it can, but I'm not sure if Flink APIs are flexible enough to do the same. – Igor Dvorzhak Feb 03 '21 at 16:33
  • Is the connector from Cloudera referred to in @rsantiago's comment above differ from the one you point to ? Also does the method he proposes to avoid nulls fit with your proposal ? Both points would make your answer stronger. Thanks – py-r Feb 07 '21 at 12:46
  • Yes, connector seems to be the same - it seems that Flink just changed its Maven artifact name from `flink_hbase_2.12` to `flink-connector-hbase_2.12` in latest Flink versions. Yes, I think that skipping null columns approach should work. – Igor Dvorzhak Feb 08 '21 at 04:28
  • 1
    @igordvorhak: Glad if you can update your answer with those facts and doubts. – py-r Feb 08 '21 at 17:24
  • Community, if you have tested this, please shout ! – py-r Feb 08 '21 at 17:24