I have parquet files stored in partitions by date in directories like:
/activity
/date=20180802
I'm using Spark 2.2 and there are 400+ partitions. My understanding is that predicate pushdown should allow me to run a query like the one below and get quick results.
spark.read.parquet(".../activity")
.filter($"date" === "20180802" && $"id" === "58ff800af2")
.show()
However, the query above is taking around 90 seconds while the query below takes around 5 seconds. Am I doing something wrong or is this expected behavior?
spark.read.parquet(".../activity/date=20180802")
.filter($"id" === "58ff800af2")
.show()