4

I am running a Spark job which performs extremely well as far as the logic goes. However, the name of my output files are in the format part-00000,part-00001 etc., when I use saveAsTextFile to save the files in a s3 bucket. Is there a way to change the output filename?

Thank you.

Bharath
  • 467
  • 2
  • 8
  • 20
  • 2
    Possible duplicate of [Renaming Part Files in Hadoop Map Reduce](http://stackoverflow.com/questions/14555313/renaming-part-files-in-hadoop-map-reduce) – Binary Nerd Jun 22 '16 at 15:31
  • it is best to do this using shell instead of in Spark. For example, you could potentially collect everything into 1 file by using `coalesce`but it puts strain on the memory - also, hdfs works slightly different from a regular file system, and Spark always create a different destination/folder for each output. – GameOfThrows Jun 22 '16 at 15:41
  • isnt it this kind of... https://gist.github.com/mlehman/df9546f6be2e362bbad2 – Ram Ghadiyaram Jun 22 '16 at 17:57

2 Answers2

5

In Spark, you can use saveAsNewAPIHadoopFile and set mapreduce.output.basename parameter in hadoop configuration to change the prefix (Just the "part" prefix)

val hadoopConf = new Configuration()
hadoopConf.set("mapreduce.output.basename", "yourPrefix")

yourRDD.map(str => (null, str))
        .saveAsNewAPIHadoopFile(s"$outputPath/$dirName", classOf[NullWritable], classOf[String],
          classOf[TextOutputFormat[NullWritable, String]], hadoopConf)

Your files will be named like: yourPrefix-r-00001

In hadoop and Spark, you can have more than one file in the output since you can have more than one reducer(hadoop) or more than one partition(spark). Then you need to warranty unique names for each of them, that is why it is not possible to override the sequence number at the last part of the filename.

But if you want to have more control of your filename, you can extend TextOutputFormat or FileOutputFormat and override the getUniqueFile method.

RojoSam
  • 1,476
  • 12
  • 15
  • Thank you for the comment. Can we save the files to a s3 bucket instead of hdfs? – Bharath Jun 22 '16 at 19:30
  • Yes you can, Amazon S3 is compliant with Hadoop API. – Paweł Jurczenko Jun 22 '16 at 20:36
  • 1
    Hadoop FileSystem implement several protocols/file systems including S3: https://wiki.apache.org/hadoop/AmazonS3. You can use any of those almost transparent (You just need to specify the specific parameters for each kind of connection).If you think my answer helped you with your original question please accept it. – RojoSam Jun 22 '16 at 21:00
  • can you please look at this question https://stackoverflow.com/questions/46703623/how-to-rename-spark-data-frame-output-file-in-aws-in-spark-scala – Sudarshan kumar Jan 02 '18 at 08:44
  • I think it's `getUniqueName` now. Is there a way I can use a field from the row as a filename, e.g. id? – Chris May 08 '21 at 00:56
1

[Solution in Java]

Lets say you have :

JavaRDD<Text> rows;

And you want to write it to files like customPrefix-r-00000 .

Configuration hadoopConf = new Configuration();
hadoopConf.set("mapreduce.output.basename", "customPrefix");

rows.mapToPair(row -> new Tuple2(null, row)).saveAsNewAPIHadoopFile(outputPath, NullWritable.class, Text.class, TextOutputFormat.class, hadoopConf);

Tada!!

  • Hi, I tried setting basename and it doesn't work: job.getConfiguration().set("mapreduce.output.basename", inputStrategyName); javaPairRDD.saveAsNewAPIHadoopFile(outputPath, new AvroKey().getClass(), NullWritable.class, AvroKeyOutputFormat.class, job.getConfiguration()); – Kans Jul 26 '18 at 21:48
  • What's the class of job object ? Just make a new configuration object of this class : org.apache.hadoop.conf.Configuration; – chandan kharbanda Aug 02 '18 at 11:07