5

I know this question has been asked before but i am unable to get a clear working answer.

result.saveAsTextFile(path);
  1. when using spark saveAsTextFile the output is saved by the name of "part-00", "part-01" etc. Is it possible to change this name to customized name?

  2. Is it possible for a saveAsTextFile to append to existing file rather then overwriting it ?

I am using Java 7 for coding, the output file system would be cloud (Azure, Aws)

Peter Pan
  • 23,476
  • 4
  • 25
  • 43
duck
  • 2,483
  • 1
  • 24
  • 34
  • it is by design that files are split . you can always merge them to a single file http://stackoverflow.com/questions/5700068/merge-output-files-after-reduce-phase – Aravind Aug 11 '16 at 07:26
  • You can do it with a custom OutputFormat, but it will be quite a bit of effort. The file names come from deep in the file writing system. I would suggest you to just accept the file names as they are. And to access the file you can do sc.textFile(filepath). That will work. To merge those partitions split into a single file you can use coalesce. – anshul_cached Aug 11 '16 at 07:29
  • Thanks, any comments on append to a file – duck Aug 11 '16 at 07:38
  • @duck, For appending to a file, please refer to the answer for http://stackoverflow.com/questions/9162943/how-does-hdfs-with-append-works, and HDInsight on Azure is based on Hortonworks distribution that you can refer to https://community.hortonworks.com/questions/16990/append-in-hdfs.html. – Peter Pan Aug 12 '16 at 09:22
  • @MFST the links you provided are not helping, these are just theory explaining how append works. what i required is how can append using spark, a code snipped would be helpful. – duck Aug 12 '16 at 16:54

1 Answers1

0

1) There is no direct support in saveAsTextFile method to control file output name. You can try using saveAsHadoopDataset to control output file basename.

e.g.: instead of part-00000 you can get yourCustomName-00000.

Keep in mind that you cannot control the suffix 00000 using this method. It is something spark automatically assigns for each partition while writing so that each partition writes to a unique file.

In order to control that too as mentioned above in the comments you have to write your own custom OutputFormat.

SparkConf conf=new SparkConf();
conf.setMaster("local").setAppName("yello");
JavaSparkContext sc=new JavaSparkContext(conf);

JobConf jobConf=new JobConf();
jobConf.set("mapreduce.output.basename", "customName");
jobConf.set("mapred.output.dir", "outputPath");

JavaRDD<String> input = sc.textFile("inputDir");
input.saveAsHadoopDataset(jobConf);

2) A workaround would be to write output as it is to your output location and use Hadoop FileUtil.copyMerge function to form merged file.

sujit
  • 455
  • 6
  • 12