0

I have an output RDD in my spark code written in python. I want to save it in Amazon S3 as gzipped file. I have tried following functions. The below function correctly saves the output rdd in s3 but not in gzipped format.

output_rdd.saveAsTextFile("s3://<name-of-bucket>/")

The below function returns error:: TypeError: saveAsHadoopFile() takes at least 3 arguments (3 given)

output_rdd.saveAsHadoopFile("s3://<name-of-bucket>/", 
                        compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec"
                       )

Please guide me with the correct way to do this.

1 Answers1

0

You need to specify the output format as well.

Try this:

output_rdd.saveAsHadoopFile("s3://<name-of-bucket>/", "org.apache.hadoop.mapred.TextOutputFormat", compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

You can use any of the Hadoop-supported compression codecs:

  • gzip: org.apache.hadoop.io.compress.GzipCodec
  • bzip2: org.apache.hadoop.io.compress.BZip2Codec
  • LZO: com.hadoop.compression.lzo.LzopCodec
Avihoo Mamka
  • 4,656
  • 3
  • 31
  • 44
  • Not related to the asked question directly, but you may also wanna consider using s3a or s3n for faster and huge volume writes. [reference| http://stackoverflow.com/questions/33356041/technically-what-is-the-difference-between-s3n-s3a-and-s3] – Pramit Oct 01 '16 at 01:55