Filtering from phoenix when loading a table

Question

I would like to know how this exactly works,

df = sqlContext.read \
          .format("org.apache.phoenix.spark") \
          .option("table", "TABLE") \
          .option("zkUrl", "10.0.0.11:2181:/hbase-unsecure") \
          .load()

if this is loading the whole table or it will delay the loading to know if a filtering will be applied.

In the first case, how is the way to tell phoenix to filter the table before loading in the spark dataframe?

Thanks

score 3 · Accepted Answer · edited Nov 29 '16 at 16:54

3

Data is not loaded until you execute an action which requires it. All filter applied in the middle:

df.where($"foo" === "bar").count

will be pushed down by Spark if it is possible. You can watch results of predicate pushdown by running explain()

edited Nov 29 '16 at 16:54

T. Gawęda

15,706
4
46
61

answered Nov 29 '16 at 16:20

I know it for other kind of loads, but I can not put a filter before the load now :S. (I am with pyspark and phoenix) – Pablo Castilla Nov 29 '16 at 16:26
@PabloCastilla If you write `spark.(..).load().where(...).count`, then Spark will automatically do predicate pushdown. You don't have to deal with it – T. Gawęda Nov 29 '16 at 16:44
1

You are totally correct. I have seen that with the explain() function. Thanks! – Pablo Castilla Nov 29 '16 at 16:53
explain() function show below：*(1) Filter (isnotnull(UPDATED_AT#68) && (UPDATED_AT#68 > 2019-04-23 00:00:00))+- *(1) Scan PhoenixRelation..... PushedFilters:..... – aof Apr 26 '19 at 05:51

Filtering from phoenix when loading a table

1 Answers1

Linked