I am running spark job
in a cluster which has 2 worker nodes! I am using the code below (spark java) for saving the computed dataframe as csv to worker nodes.
dataframe.write().option("header","false").mode(SaveMode.Overwrite).csv(outputDirPath);
I am trying to understand how spark writes multiple part files on each worker node.
Run1) worker1
has part files
and SUCCESS
; worker2
has _temporarty/task*/part*
each task has the part files run.
Run2) worker1
has part files and also _temporary
directory; worker2
has multiple part files
Can anyone help me understand why is this behavior?
1)Should I consider the records in outputDir/_temporary
as part of the output file along with the part files in outputDir
?
2)Is _temporary
dir supposed to be deleted after job run and move the part
files to outputDir
?
3)why can't it create part files directly under ouput dir?
coalesce(1)
and repartition(1)
cannot be the option since the outputDir file itself will be around 500GB
Spark 2.0.2. 2.1.3
and Java 8, no HDFS