Spark deletes all the existing partitions while writing an empty dataframe with overwrite.
I have a code below to write data to s3 which runs daily.
This dynamically adds new partitions when there are new partitions values but when the dataframe is empty, it deletes all the partitions.
I have to use 'overwrite' here to avoid duplicate data. (append is not an option)
df.write
.partitionBy("year", "month", "day")
.mode("overwrite")
.parquet("s3://..../table1/")
I can check whether the dataframe is empty before writing but it adds overhead.
I feel this behavior is odd because overwrite should overwrite only the matching partitions.
What is the right way to handle this scenario?
Note: This question is different from the below question as my question is specific to empty dataframe and writing to S3 (not to any table) and the solution for the below question didn't work as the property "spark.sql.sources.partitionOverwriteMode" is already set to "dynamic".