I have parquet data files partitioned by country and as of date.
sales
country=USA
asOfDate=2016-01-01
asofDate=2016-01-02
country=FR
....
I need to process the data where the user can choose which countries to process and for which as of date for each countries.
Country, Start Date, End Date
USA, 2016-01-01, 2016-03-31
FR, 2016-02-01, 2016-08-31
...
What will be the most optimum way to read this data using Spark 2.x that will prevent Spark to scan the whole dataset? I have a couple of alternatives:
Simply use filter:
filter("(country = "USA" AND asOfDate >= "2016-01-01" AND asOfDate <= "2016-03-31") OR (....)")
Construct the directory manually and pass each subdirectory to the parquet read:
spark.read.parquet("/sales/country=USA/asOfDate=2016-01-01", ""/sales/country=USA/asOfDate=2016-01-02",...)
Option 2 is very tedious, but I'm not sure if option 1 will cause Spark to scan all files in all directories.
Update: This is not a duplicate, as the other question is about the pruning, while this one is on how to best read partitioned parquet file through Spark API.