Parquet pushdown filters in Amazon S3/EMR

Question

Does predicate pushdown works if I am running on a spark cluster in these scenarios:

Custom EC2 instances with spark running on them and parquet files reside in S3
Spark cluster running on EMR and again the parquet files are on S3.

Found a similar question here but the answers on it are too old.

Obviously, the "yes" answer that was valid then is still valid now. Plus, in more recent versions of Spark since 2.2 referred in that answer, a pushdown for other data types (timestamp, decimal) was added. — mazaneicha, Sep 05 '19 at 14:35
I already mentioned that it is a duplicate but the answers on that are too old and not valid on anymore as spark, aws and emr have evolved in 3 years — Sumit Agarwal, Sep 05 '19 at 14:36
@mazaneicha can you refer me to some documentation where it mentions about the pushdown in detail? — Sumit Agarwal, Sep 05 '19 at 14:39
@SumitAgarwal I copied my comment into an answer and added a couple of links. thanks. — mazaneicha, Sep 05 '19 at 15:19

score 3 · Answer 1 · answered Sep 05 '19 at 15:15

3

The YES answer is still valid, along with the underlying premise that the pushdown capability in parquet is not storage type dependent. Plus, recent Spark version (2.4) added pushdowns for other data types (timestamp, decimal) and predicates.

You can review the changes via Spark JIRA, for example, or by reading the source code/history if you prefer the ultimate truth.

answered Sep 05 '19 at 15:15

mazaneicha

8,794
4
33
52

This is what I was looking for: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3select.html – Sumit Agarwal Sep 05 '19 at 18:38
Thanks @SumitAgarwal, thats very interesting! But if I understood their notion, `S3 Select` is suggested as an alternative format to enable push-down for CSV and JSON files (with some limitations), no mention of parquet format there. – mazaneicha Sep 05 '19 at 20:13
agree, I am still exploring more on it and maybe do a hands-on to reach the final conclusion. Will share here if I find something about it. – Sumit Agarwal Sep 05 '19 at 20:45
s3 select may do push down on parquet but the result comes back as json so isn't that useful.you are probably better off with the real parquet libs in spark – stevel Sep 07 '19 at 13:33

Parquet pushdown filters in Amazon S3/EMR

1 Answers1