Change spark _temporary directory path to avoid deletion of parquets

Question

When two or more Spark jobs have the same output directory, mutual deletion of files will be inevitable.

I'm writting a dataframe in append mode with spark 2.4.4 and I want to add a timestamp to the tmp dir of spark to avoid these deletion.

example:

my JobSpark write in hdfs:/outputFile/0/tmp/file1.parquet

the same spark job called with other data and write in hdfs:/outputFil/0/tm/file2.parquet

I want jobSpark1 write in hdfs:/outputFile/0/tmp+(timeStamp)/file1.parquet and the other job write in hdfs:/outputFile/0/tmp+(timeStamp)/file2.parquet and next move parquets to hdfs:/outputFile/

Does this answer your question? [Multiple spark jobs appending parquet data to same base path with partitioning](https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning) — mazaneicha, Mar 19 '20 at 16:13

score -1 · Answer 1 · answered Mar 19 '20 at 15:00

df
  .write
        .option("mapreduce.fileoutputcommitter.algorithm.version", "2")
        .partitionBy("XXXXXXXX")
        .mode(SaveMode.Append)
        .format(fileFormat)
        .save(path)

When Spark appends data to an existing dataset, Spark uses FileOutputCommitter to manage staging output files and final output files. The behavior of FileOutputCommitter has direct impact on the performance of jobs that write data.

A FileOutputCommitter has two methods, commitTask and commitJob. Apache Spark 2.0 and higher versions use Apache Hadoop 2, which uses the value of mapreduce.fileoutputcommitter.algorithm.version to control how commitTask and commitJob work. In Hadoop 2, the default value of mapreduce.fileoutputcommitter.algorithm.version is 1. For this version, commitTask moves data generated by a task from the task temporary directory to job temporary directory and when all tasks complete, commitJob moves data to from job temporary directory to the final destination.

Because the driver is doing the work of commitJob, for cloud storage, this operation can take a long time. You may often think that your cell is “hanging”. However, when the value of mapreduce.fileoutputcommitter.algorithm.version is 2, commitTask moves data generated by a task directly to the final destination and commitJob is basically a no-op.

will there be any performance issue on v2 ? considering number of writes happenning on same directory , concurrent jobs writing ,.. — amarnath harish, May 16 '22 at 07:07
i tried this , and it still created a _temporary directory . — amarnath harish, May 16 '22 at 07:14

Change spark _temporary directory path to avoid deletion of parquets

1 Answers1