I am trying to write large json rdd files into AWS S3 with scala and spark on Zeppelin. When I try to write it with saveAsTextFile()
method it creates _temporary/0/.. folders and writes each rdd parts into this folders. This process is quite fast but after this, I think it copies each parts into the target folder and this copying process is very slow;
rdd.saveAsTextFile(outputPath + filename)
I've tried to convert rdd to spark dataframe and write it with the following code and it was still the same;
sc.hadoopConfiguration.set("spark.hadoop.mapred.output.committer.class", "com.appsflyer.spark.DirectOutputCommitter")
val DF = rdd.toDF()
DF.repartition(50).write.json(outputPath + filename)
I found a way to write it as parquet file;
sc.hadoopConfiguration.set("spark.sql.parquet.output.committer.class", "org.apache.spark.sql.parquet.DirectParquetOutputCommitter")
val dataDF = rdd.toDF()
dataDF.repartition(10).write.parquet(outputPath+filename)
It works. It is fast because it doesn't create temporary folder. It writes it directly to the target. But I need to write this rdd as json txt file. Is there any way to write json files without creating temporary folder on AWS S3? Or any committer for json files like "org.apache.spark.sql.parquet.DirectParquetOutputCommitter"?