Write parquet with partitionby vs. just a loop

Question

Let's say have a script which writes a parquet file every week in 2 partitions: DAY and COUNTRY, in a FOLDER.
SOLUTION 1:

   df.write.parquet(FOLDER, mode='overwrite',
                     partitionBy=['DAY', 'COUNTRY'])

The problem with this is that if later you want to rerun the script just for a specific country and date due to corrupted data in that partition, it will delete the whole folder's contents, and write in data just for the speciffic day/country. APPEND also doesnt solve it, it would just append the correct data to the wrong one.
What would be ideal is that if the above command ONLY overwrote the DAY/COUNTRY combos which the df has.

SOLUTION 2:
Make a loop:

for country in countries:
       for day in days:
            df.write.parquet(FOLDER/day/country, mode='overwrite')

This works, because if I run the script, it only overwrites the files in the specific FOLDER/day/country, it just feels so wrong. Any better alternative?

score 2 · Answer 1 · answered Jul 11 '20 at 16:44

If you are using spark 2.3 or above, you can create a partitioned table and set the spark.sql.sources.partitionOverwriteMode setting to dynamic

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
df.write.mode("overwrite").insertInto("yourtable")

Write parquet with partitionby vs. just a loop

1 Answers1