Spark output filename and append on write

Question

I know this question has been asked before but i am unable to get a clear working answer.

result.saveAsTextFile(path);

when using spark saveAsTextFile the output is saved by the name of "part-00", "part-01" etc. Is it possible to change this name to customized name?
Is it possible for a saveAsTextFile to append to existing file rather then overwriting it ?

I am using Java 7 for coding, the output file system would be cloud (Azure, Aws)

it is by design that files are split . you can always merge them to a single file http://stackoverflow.com/questions/5700068/merge-output-files-after-reduce-phase — Aravind, Aug 11 '16 at 07:26
You can do it with a custom OutputFormat, but it will be quite a bit of effort. The file names come from deep in the file writing system. I would suggest you to just accept the file names as they are. And to access the file you can do sc.textFile(filepath). That will work. To merge those partitions split into a single file you can use coalesce. — anshul_cached, Aug 11 '16 at 07:29
@duck, For appending to a file, please refer to the answer for http://stackoverflow.com/questions/9162943/how-does-hdfs-with-append-works, and HDInsight on Azure is based on Hortonworks distribution that you can refer to https://community.hortonworks.com/questions/16990/append-in-hdfs.html. — Peter Pan, Aug 12 '16 at 09:22
@MFST the links you provided are not helping, these are just theory explaining how append works. what i required is how can append using spark, a code snipped would be helpful. — duck, Aug 12 '16 at 16:54

score 0 · Answer 1 · answered Sep 15 '16 at 07:16

1) There is no direct support in saveAsTextFile method to control file output name. You can try using saveAsHadoopDataset to control output file basename.

e.g.: instead of part-00000 you can get yourCustomName-00000.

Keep in mind that you cannot control the suffix 00000 using this method. It is something spark automatically assigns for each partition while writing so that each partition writes to a unique file.

In order to control that too as mentioned above in the comments you have to write your own custom OutputFormat.

SparkConf conf=new SparkConf();
conf.setMaster("local").setAppName("yello");
JavaSparkContext sc=new JavaSparkContext(conf);

JobConf jobConf=new JobConf();
jobConf.set("mapreduce.output.basename", "customName");
jobConf.set("mapred.output.dir", "outputPath");

JavaRDD<String> input = sc.textFile("inputDir");
input.saveAsHadoopDataset(jobConf);

2) A workaround would be to write output as it is to your output location and use Hadoop FileUtil.copyMerge function to form merged file.

Spark output filename and append on write

1 Answers1