5

I have done quite a lot research for this question and no where found the satisfying answer . I have to have rename my output file coming out of the spark .

Currently i am output my spark data frame in S3 and then i read it again and then rename and copy again . Issue with this is my spark job takes 16 minutes to complete but reading from S3 and then renaming and writing again in S3 takes another 15 minutes .

Is there any way i can rename my output file ..I am ok with part-00000

This is how i save my data frame

dfMainOutputFinalWithoutNull.repartition(50).write.partitionBy("DataPartition", "PartitionYear")
      .format("csv")
      .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
      .option("nullValue", "")
      .option("delimiter", "\t")
      .option("quote", "\u0000")
      .option("header", "true")
      .option("codec", "bzip2")
      .save(outputFileURL)

Any ides how to use hadoop file format in this case ?

Currently i am doing this like below

val finalFileName = finalPrefix + DataPartitionName + "." + YearPartition + "." + intFileCounter + "." + fileVersion + currentTime + fileExtention
      val dest = new Path(mainFileURL + "/" + finalFileName)
      fs.rename(urlStatus.getPath, dest)

Problem is i have 50GB output data and it create very huge no of files,renaming this much files takes very very long time .

Cost wise also it expensive because my EMR runs for longer time and copying data again cost extra .

0 Answers0