0

I have a Java Spark (v2.4.7) job that currently reads the entire table from Hbase. My table has millions of rows and reading the entire table is very expensive (memory). My process doesn't need all the data from the Hbase table, how can I avoid reading rows with specific keys?

Currently, I read from Hbase as following:

JavaRDD<Tuple2<ImmutableBytesWritable, Result>> jrdd = sparkSession.sparkContext().newAPIHadoopRDD(DataContext.getConfig(),
            TableInputFormat.class, ImmutableBytesWritable.class, Result.class)

I saw the answer in this post, but I didn't find how can I filter out specific keys.

Any help? Thanks!

Oded
  • 336
  • 1
  • 3
  • 17
  • 1
    Generally in HBase you design your table such that your query only needs to refer to a consecutive set of rows. HBase offers several different types of row filter - start with https://stackoverflow.com/questions/17558547/hbase-easy-how-to-perform-range-prefix-scan-in-hbase-shell. – Ben Watson Aug 04 '21 at 09:13
  • Why is the job memory expensive? Are you loading the entire data into memory? – shay__ Aug 05 '21 at 06:17
  • Yes. I would like to read only part of it. – Oded Aug 05 '21 at 18:26

0 Answers0