0

I have parquet data files partitioned by country and as of date.

sales
  country=USA
    asOfDate=2016-01-01
    asofDate=2016-01-02
  country=FR
....

I need to process the data where the user can choose which countries to process and for which as of date for each countries.

Country, Start Date, End Date
USA, 2016-01-01, 2016-03-31
FR, 2016-02-01, 2016-08-31
...

What will be the most optimum way to read this data using Spark 2.x that will prevent Spark to scan the whole dataset? I have a couple of alternatives:

  1. Simply use filter:

    filter("(country = "USA" AND asOfDate >= "2016-01-01" AND asOfDate <= "2016-03-31") OR (....)")
    
  2. Construct the directory manually and pass each subdirectory to the parquet read:

    spark.read.parquet("/sales/country=USA/asOfDate=2016-01-01", ""/sales/country=USA/asOfDate=2016-01-02",...)
    

Option 2 is very tedious, but I'm not sure if option 1 will cause Spark to scan all files in all directories.

Update: This is not a duplicate, as the other question is about the pruning, while this one is on how to best read partitioned parquet file through Spark API.

Shaido
  • 27,497
  • 23
  • 70
  • 73
suriyanto
  • 1,075
  • 12
  • 19
  • Were you able to solve this? I really want to know how are you reading in the files? I was trying `spark.read.parquet(basepath).filter($"exp_start_date.geq("2018-02-08"))` but it doesn't work for me. :( The "exp_start_date" is my partition name – jimseeve Feb 11 '18 at 22:36
  • @jimseeve Did you resolve this by the way? – thebluephantom Aug 15 '18 at 10:14
  • @jimseeve, option 1 works for me. I can see in the info log on how Spark only picks the files from the correct directory. – suriyanto Nov 13 '18 at 08:38

2 Answers2

1

Definitely 1.

You can see by yourself by using .explain(extended = true) on your dataset's query (or directly inside the Spark UI SQL page) to see what's going how for your reads. You want to look for a PushDown Predicate. Pushdown means evaluate at storage, so this will read the required data.

More details : https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-Optimizer-PushDownPredicate.html

Paul Leclercq
  • 989
  • 2
  • 15
  • 26
0

You can store the data in a hive table that is partitioned by date and country.

The files will be stored in separate folders, but the hive metastore will manage it for you

lev
  • 3,986
  • 4
  • 33
  • 46