Writing Spark RDD as Gzipped file in Amazon s3

Question

I have an output RDD in my spark code written in python. I want to save it in Amazon S3 as gzipped file. I have tried following functions. The below function correctly saves the output rdd in s3 but not in gzipped format.

output_rdd.saveAsTextFile("s3://<name-of-bucket>/")

The below function returns error:: TypeError: saveAsHadoopFile() takes at least 3 arguments (3 given)

output_rdd.saveAsHadoopFile("s3://<name-of-bucket>/", 
                        compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec"
                       )

Please guide me with the correct way to do this.

Avihoo Mamka · Answer 1 · 2016-02-08T09:58:18.417

0

You need to specify the output format as well.

Try this:

output_rdd.saveAsHadoopFile("s3://<name-of-bucket>/", "org.apache.hadoop.mapred.TextOutputFormat", compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

You can use any of the Hadoop-supported compression codecs:

gzip: org.apache.hadoop.io.compress.GzipCodec
bzip2: org.apache.hadoop.io.compress.BZip2Codec
LZO: com.hadoop.compression.lzo.LzopCodec

edited Feb 08 '16 at 09:58

answered Feb 08 '16 at 09:33

Avihoo Mamka

4,656
3
31
44

Not related to the asked question directly, but you may also wanna consider using s3a or s3n for faster and huge volume writes. [reference| http://stackoverflow.com/questions/33356041/technically-what-is-the-difference-between-s3n-s3a-and-s3] – Pramit Oct 01 '16 at 01:55

Writing Spark RDD as Gzipped file in Amazon s3

1 Answers1