3

I would like to know how this exactly works,

df = sqlContext.read \
          .format("org.apache.phoenix.spark") \
          .option("table", "TABLE") \
          .option("zkUrl", "10.0.0.11:2181:/hbase-unsecure") \
          .load()

if this is loading the whole table or it will delay the loading to know if a filtering will be applied.

In the first case, how is the way to tell phoenix to filter the table before loading in the spark dataframe?

Thanks

Pablo Castilla
  • 2,723
  • 2
  • 28
  • 33

1 Answers1

3

Data is not loaded until you execute an action which requires it. All filter applied in the middle:

df.where($"foo" === "bar").count

will be pushed down by Spark if it is possible. You can watch results of predicate pushdown by running explain()

T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
  • I know it for other kind of loads, but I can not put a filter before the load now :S. (I am with pyspark and phoenix) – Pablo Castilla Nov 29 '16 at 16:26
  • @PabloCastilla If you write `spark.(..).load().where(...).count`, then Spark will automatically do predicate pushdown. You don't have to deal with it – T. Gawęda Nov 29 '16 at 16:44
  • 1
    You are totally correct. I have seen that with the explain() function. Thanks! – Pablo Castilla Nov 29 '16 at 16:53
  • explain() function show below:*(1) Filter (isnotnull(UPDATED_AT#68) && (UPDATED_AT#68 > 2019-04-23 00:00:00))+- *(1) Scan PhoenixRelation..... PushedFilters:..... – aof Apr 26 '19 at 05:51