Save a large Spark Dataframe as a single json file in S3

Question

Im trying to save a Spark DataFrame (of more than 20G) to a single json file in Amazon S3, my code to save the dataframe is like this :

dataframe.repartition(1).save("s3n://mybucket/testfile","json")

But im getting an error from S3 "Your proposed upload exceeds the maximum allowed size", i know that the maximum file size allowed by Amazon is 5GB.

Is it possible to use S3 multipart upload with Spark? or there is another way to solve this?

Btw i need the data in a single file because another user is going to download it after.

*Im using apache spark 1.3.1 in a 3-node cluster created with the spark-ec2 script.

Thanks a lot

JG

I just saw that if i use s3a instead of s3n it could solve my problem (http://wiki.apache.org/hadoop/AmazonS3) , but the thing is that the hadoop version that im using (Hadoop 2.0.0-cdh4.2.0) it does not support s3a. Any ideas? Thanks again. — jegordon, Apr 28 '15 at 03:07

Jared · Accepted Answer · 2015-08-31T20:27:51.253

30

I would try separating the large dataframe into a series of smaller dataframes that you then append into the same file in the target.

df.write.mode('append').json(yourtargetpath)

edited Aug 31 '15 at 20:27

answered Jun 26 '15 at 14:50

Jared

2,904
6
33
37

22

@TheRandomSuite: By any chance, do you know if it is possible to avoid the hadoopish format and store data to a file under a s3 key name of my choice instead of the directory with `_SUCCES` and `part-*` ? – lisak May 19 '16 at 20:37

score 7 · Answer 2 · edited Feb 27 '17 at 10:00

7

Try this

dataframe.write.format("org.apache.spark.sql.json").mode(SaveMode.Append).save("hdfs://localhost:9000/sampletext.txt");

edited Feb 27 '17 at 10:00

David Arenburg

91,361
17
137
196

answered Jan 27 '16 at 08:41

Venu A Positive

2,992
2
28
31

score -4 · Answer 3 · answered Apr 28 '15 at 04:36

s3a is not production version in Spark I think. I would say the design is not sound. repartition(1) is going to be terrible (what you are telling spark is to merge all partitions to a single one). I would suggest to convince the downstream to download contents from a folder rather than a single file

Save a large Spark Dataframe as a single json file in S3

3 Answers3

Linked