2

Spark deletes all the existing partitions while writing an empty dataframe with overwrite.

I have a code below to write data to s3 which runs daily.

This dynamically adds new partitions when there are new partitions values but when the dataframe is empty, it deletes all the partitions.

I have to use 'overwrite' here to avoid duplicate data. (append is not an option)

df.write
      .partitionBy("year", "month", "day")
      .mode("overwrite")
      .parquet("s3://..../table1/")

I can check whether the dataframe is empty before writing but it adds overhead.

I feel this behavior is odd because overwrite should overwrite only the matching partitions.

What is the right way to handle this scenario?

Note: This question is different from the below question as my question is specific to empty dataframe and writing to S3 (not to any table) and the solution for the below question didn't work as the property "spark.sql.sources.partitionOverwriteMode" is already set to "dynamic".

overwrite hive partitions using spark

Munesh
  • 1,509
  • 3
  • 20
  • 46
  • @mazaneicha This question is not a duplicate and added the explanation in the question. – Munesh Oct 30 '19 at 20:54
  • When there are no partitions (empty DF) it writes to the root level (table1). At this point it will overwrite the remaining objects further down. The best solution is use a simple if statements `if(df.count!=0) {df.write.partitionBy("year", "month", "day").mode("overwrite").parquet("s3://.../table1/")}` – afeldman Oct 30 '19 at 21:21
  • As a workaround, I'm checking whether the dataframe is empty or not using below code but it seems not the right way. if(df.head(1).size == 0) – Munesh Oct 30 '19 at 21:26

0 Answers0