I have done quite a lot research for this question and no where found the satisfying answer . I have to have rename my output file coming out of the spark .
Currently i am output my spark data frame in S3 and then i read it again and then rename and copy again . Issue with this is my spark job takes 16 minutes to complete but reading from S3 and then renaming and writing again in S3 takes another 15 minutes .
Is there any way i can rename my output file ..I am ok with part-00000
This is how i save my data frame
dfMainOutputFinalWithoutNull.repartition(50).write.partitionBy("DataPartition", "PartitionYear")
.format("csv")
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.option("nullValue", "")
.option("delimiter", "\t")
.option("quote", "\u0000")
.option("header", "true")
.option("codec", "bzip2")
.save(outputFileURL)
Any ides how to use hadoop file format in this case ?
Currently i am doing this like below
val finalFileName = finalPrefix + DataPartitionName + "." + YearPartition + "." + intFileCounter + "." + fileVersion + currentTime + fileExtention
val dest = new Path(mainFileURL + "/" + finalFileName)
fs.rename(urlStatus.getPath, dest)
Problem is i have 50GB output data and it create very huge no of files,renaming this much files takes very very long time .
Cost wise also it expensive because my EMR runs for longer time and copying data again cost extra .