I am using pyspark to overwrite my parquet partitions in an s3 bucket. Below are how my partitioned folders look like :
parent_folder
-> year=2019
-->month=1
---->date=2019-01-01
---->date=2019-01-02
-->month=2
........
-> year=2020
-->month=1
---->date=2020-01-01
---->date=2020-01-02
-->month=2
........
Now when i run a spark script that needs to overwrite only specific partitions by using the below line , lets say the partitions for year=2020 and month=1 and dates=2020-01-01 and 2020-01-02 :
df_final.write.partitionBy([["year","month","date"]"]).mode("overwrite").format("parquet").save(output_dir_path)
The above line deletes all the other partitions and writes back the data thats only present in the final dataframe - df_final. I have also set overwrite model to dynamic using below , but doesn't seem to work:
conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
My questions is , is there a way to only overwrite specific partitions(more than one ) . ANy help will be much appreciated. Thanks in advance.