Writing json rdd into AWS S3 without creating _temporary folder

Question

I am trying to write large json rdd files into AWS S3 with scala and spark on Zeppelin. When I try to write it with saveAsTextFile() method it creates _temporary/0/.. folders and writes each rdd parts into this folders. This process is quite fast but after this, I think it copies each parts into the target folder and this copying process is very slow;

rdd.saveAsTextFile(outputPath + filename)

I've tried to convert rdd to spark dataframe and write it with the following code and it was still the same;

sc.hadoopConfiguration.set("spark.hadoop.mapred.output.committer.class", "com.appsflyer.spark.DirectOutputCommitter")
val DF = rdd.toDF()
DF.repartition(50).write.json(outputPath + filename)

I found a way to write it as parquet file;

sc.hadoopConfiguration.set("spark.sql.parquet.output.committer.class", "org.apache.spark.sql.parquet.DirectParquetOutputCommitter")
val dataDF = rdd.toDF()
dataDF.repartition(10).write.parquet(outputPath+filename)

It works. It is fast because it doesn't create temporary folder. It writes it directly to the target. But I need to write this rdd as json txt file. Is there any way to write json files without creating temporary folder on AWS S3? Or any committer for json files like "org.apache.spark.sql.parquet.DirectParquetOutputCommitter"?

I think this answers your question http://stackoverflow.com/a/40423460/4697497 — ImDarrenG, Nov 24 '16 at 14:54
I ended up writing a function to write to S3 from a memory stream for each partition, which suited my use-case, YMMV — ImDarrenG, Nov 24 '16 at 14:55

Writing json rdd into AWS S3 without creating _temporary folder

0 Answers0