1

When two or more Spark jobs have the same output directory, mutual deletion of files will be inevitable.

I'm writting a dataframe in append mode with spark 2.4.4 and I want to add a timestamp to the tmp dir of spark to avoid these deletion.

example:

my JobSpark write in hdfs:/outputFile/0/tmp/file1.parquet

the same spark job called with other data and write in hdfs:/outputFil/0/tm/file2.parquet

I want jobSpark1 write in hdfs:/outputFile/0/tmp+(timeStamp)/file1.parquet and the other job write in hdfs:/outputFile/0/tmp+(timeStamp)/file2.parquet and next move parquets to hdfs:/outputFile/

  • Does this answer your question? [Multiple spark jobs appending parquet data to same base path with partitioning](https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning) – mazaneicha Mar 19 '20 at 16:13

1 Answers1

-1
df
  .write
        .option("mapreduce.fileoutputcommitter.algorithm.version", "2")
        .partitionBy("XXXXXXXX")
        .mode(SaveMode.Append)
        .format(fileFormat)
        .save(path)

When Spark appends data to an existing dataset, Spark uses FileOutputCommitter to manage staging output files and final output files. The behavior of FileOutputCommitter has direct impact on the performance of jobs that write data.

A FileOutputCommitter has two methods, commitTask and commitJob. Apache Spark 2.0 and higher versions use Apache Hadoop 2, which uses the value of mapreduce.fileoutputcommitter.algorithm.version to control how commitTask and commitJob work. In Hadoop 2, the default value of mapreduce.fileoutputcommitter.algorithm.version is 1. For this version, commitTask moves data generated by a task from the task temporary directory to job temporary directory and when all tasks complete, commitJob moves data to from job temporary directory to the final destination.

Because the driver is doing the work of commitJob, for cloud storage, this operation can take a long time. You may often think that your cell is “hanging”. However, when the value of mapreduce.fileoutputcommitter.algorithm.version is 2, commitTask moves data generated by a task directly to the final destination and commitJob is basically a no-op.