Let's say have a script which writes a parquet file every week in 2 partitions: DAY and COUNTRY, in a FOLDER.
SOLUTION 1:
df.write.parquet(FOLDER, mode='overwrite',
partitionBy=['DAY', 'COUNTRY'])
The problem with this is that if later you want to rerun the script just for a specific country and date due to corrupted data in that partition, it will delete the whole folder's contents, and write in data just for the speciffic day/country.
APPEND also doesnt solve it, it would just append the correct data to the wrong one.
What would be ideal is that if the above command ONLY overwrote the DAY/COUNTRY combos which the df has.
SOLUTION 2:
Make a loop:
for country in countries:
for day in days:
df.write.parquet(FOLDER/day/country, mode='overwrite')
This works, because if I run the script, it only overwrites the files in the specific FOLDER/day/country, it just feels so wrong. Any better alternative?