1

I have a code

val count = spark.read.parquet("data.parquet").select("foo").where("foo > 3").count

I'm interested if spark is able to push down filter somehow and read from parquet file only values satisfying where condition. Can we avoid full scan in this case?

  • 1
    Please refer https://stackoverflow.com/questions/50129411/why-is-predicate-pushdown-not-used-in-typed-dataset-api-vs-untyped-dataframe-ap – dassum Aug 26 '19 at 17:22

1 Answers1

2

Short answer is yes, in this case, but not all cases.

You can try .explain and see for yourself.

This is an excellent reference document freely available on the Internet that I learnt a few things from in the past: https://db-blog.web.cern.ch/blog/luca-canali/2017-06-diving-spark-and-parquet-workloads-example

thebluephantom
  • 16,458
  • 8
  • 40
  • 83