0

I am trying to use DataFrameReader.load("table name") to load the hive table records and return as DataFrame.

But i dont want to load the entire records, i wanted to fetch only the records with specific date (which is one of the field in a hive table).

If i add the where condition in the returned DataFrame, will it load the entire table first then filter the records based on date?

Because the hive tables are huge and it is partitioned based on date field.

Basically i want to achieve select * from table where date='date' using load method without loading the entire table.

eliasah
  • 39,588
  • 11
  • 124
  • 154
Shankar
  • 8,529
  • 26
  • 90
  • 159

1 Answers1

0

Recent versions of Spark support feature called "predicate push-down". It does exactly what you want: pushes, where it possible, SQL clauses WHERE into source. I'm not sure if predicate push-down works now with Hive data source (it works for parquet, JDBC and some others sources). See also Does spark predicate pushdown work with JDBC?

Community
  • 1
  • 1
Vitalii Kotliarenko
  • 2,947
  • 18
  • 26