I have a Java Spark (v2.4.7) job that currently reads the entire table from Hbase. My table has millions of rows and reading the entire table is very expensive (memory). My process doesn't need all the data from the Hbase table, how can I avoid reading rows with specific keys?
Currently, I read from Hbase as following:
JavaRDD<Tuple2<ImmutableBytesWritable, Result>> jrdd = sparkSession.sparkContext().newAPIHadoopRDD(DataContext.getConfig(),
TableInputFormat.class, ImmutableBytesWritable.class, Result.class)
I saw the answer in this post, but I didn't find how can I filter out specific keys.
Any help? Thanks!