df
.write
.option("mapreduce.fileoutputcommitter.algorithm.version", "2")
.partitionBy("XXXXXXXX")
.mode(SaveMode.Append)
.format(fileFormat)
.save(path)
When Spark appends data to an existing dataset, Spark uses FileOutputCommitter to manage staging output files and final output files. The behavior of FileOutputCommitter has direct impact on the performance of jobs that write data.
A FileOutputCommitter has two methods, commitTask and commitJob. Apache Spark 2.0 and higher versions use Apache Hadoop 2, which uses the value of mapreduce.fileoutputcommitter.algorithm.version to control how commitTask and commitJob work. In Hadoop 2, the default value of mapreduce.fileoutputcommitter.algorithm.version is 1. For this version, commitTask moves data generated by a task from the task temporary directory to job temporary directory and when all tasks complete, commitJob moves data to from job temporary directory to the final destination.
Because the driver is doing the work of commitJob, for cloud storage, this operation can take a long time. You may often think that your cell is “hanging”. However, when the value of mapreduce.fileoutputcommitter.algorithm.version is 2, commitTask moves data generated by a task directly to the final destination and commitJob is basically a no-op.