Spark 2.2 optimal read partitioned parquet file

Question

I have parquet data files partitioned by country and as of date.

sales
  country=USA
    asOfDate=2016-01-01
    asofDate=2016-01-02
  country=FR
....

I need to process the data where the user can choose which countries to process and for which as of date for each countries.

Country, Start Date, End Date
USA, 2016-01-01, 2016-03-31
FR, 2016-02-01, 2016-08-31
...

What will be the most optimum way to read this data using Spark 2.x that will prevent Spark to scan the whole dataset? I have a couple of alternatives:

Simply use filter:

filter("(country = "USA" AND asOfDate >= "2016-01-01" AND asOfDate <= "2016-03-31") OR (....)")

Construct the directory manually and pass each subdirectory to the parquet read:

spark.read.parquet("/sales/country=USA/asOfDate=2016-01-01", ""/sales/country=USA/asOfDate=2016-01-02",...)

Option 2 is very tedious, but I'm not sure if option 1 will cause Spark to scan all files in all directories.

Update: This is not a duplicate, as the other question is about the pruning, while this one is on how to best read partitioned parquet file through Spark API.

Were you able to solve this? I really want to know how are you reading in the files? I was trying `spark.read.parquet(basepath).filter($"exp_start_date.geq("2018-02-08"))` but it doesn't work for me. :( The "exp_start_date" is my partition name — jimseeve, Feb 11 '18 at 22:36
@jimseeve, option 1 works for me. I can see in the info log on how Spark only picks the files from the correct directory. — suriyanto, Nov 13 '18 at 08:38

score 1 · Answer 1 · answered Oct 02 '17 at 16:08

Definitely 1.

You can see by yourself by using .explain(extended = true) on your dataset's query (or directly inside the Spark UI SQL page) to see what's going how for your reads. You want to look for a PushDown Predicate. Pushdown means evaluate at storage, so this will read the required data.

More details : https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-Optimizer-PushDownPredicate.html

score 0 · Answer 2 · answered Sep 30 '17 at 16:23

0

You can store the data in a hive table that is partitioned by date and country.

The files will be stored in separate folders, but the hive metastore will manage it for you

answered Sep 30 '17 at 16:23

lev

3,986
4
33
46

Is there to do this without Hive, but just purely through Spark SQL? – suriyanto Oct 01 '17 at 22:48

Spark 2.2 optimal read partitioned parquet file

2 Answers2