0

Let's say have a script which writes a parquet file every week in 2 partitions: DAY and COUNTRY, in a FOLDER.
SOLUTION 1:

   df.write.parquet(FOLDER, mode='overwrite',
                     partitionBy=['DAY', 'COUNTRY'])

The problem with this is that if later you want to rerun the script just for a specific country and date due to corrupted data in that partition, it will delete the whole folder's contents, and write in data just for the speciffic day/country. APPEND also doesnt solve it, it would just append the correct data to the wrong one.
What would be ideal is that if the above command ONLY overwrote the DAY/COUNTRY combos which the df has.

SOLUTION 2:
Make a loop:

for country in countries:
       for day in days:
            df.write.parquet(FOLDER/day/country, mode='overwrite')

This works, because if I run the script, it only overwrites the files in the specific FOLDER/day/country, it just feels so wrong. Any better alternative?

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Chris
  • 951
  • 10
  • 26

1 Answers1

2

If you are using spark 2.3 or above, you can create a partitioned table and set the spark.sql.sources.partitionOverwriteMode setting to dynamic

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
df.write.mode("overwrite").insertInto("yourtable")

srikanth holur
  • 760
  • 4
  • 11