I have the following piece of code to save a file on S3
rdd
//drop header
.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }
//assign the Key for PartitionBy
//if the Key doesnt exist assign -1 which means all the data goes to part-00000 File
.map(line => if (colIndex == -1) (null, line) else (line.split(TILDE)(colIndex), line))
.partitionBy(customPartitioner)
.map { case (_, line) => line }
//Add Empty columns and Change the order and get the modified string
.map(line => addEmptyColumns(line, schemaIndexArray))
.saveAsTextFile(s"s3a://$bucketName/$serviceName/$folderPath")
For HDFS , there is no S3 path and the code takes 1/5th of time. Any other approaches on how to fix this?. I am setting the hadoop configuration in spark.