Ways to save RDD to S3 using pyspark

Question

I am trying to save an RDD to AWS S3 using pyspark, but get a "directory already exists" error.

The statement below works fine when "content1" folder is not present. But if I wanted to save additional files to the same folder, it gives me the above error?

rddFilteredData.repartition(5).saveAsTextFile("s3a://partners/research/content1", compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

Also, when the above command works, it creates these part-00000x.gz which is fine, but

How do I give them proper names such as research-results-00000x.gz?
Does it mean that I go to save additional files to the "content1" folder, I need to remove/move the already existing files because it will cause name conflicts?

So, therefore, what is the right way to save RDD to existing bucket / folder to handle the above scenarios? Thanks in advance.

You can take a look at https://stackoverflow.com/questions/36107581/change-output-filename-prefix-for-dataframe-write/36108367#36108367 — Jayadeep Jayaraman, Nov 08 '19 at 03:28
You can also convert your ```rdd``` to a ```dataframe``` and then use the ```mode="overwrite"```. — thePurplePython, Nov 08 '19 at 03:32
@thePurplePython would prefer to stay with rdd for now. No particular reason but just getting comfortable with it before adventuring to dataframe. — NetRocks, Nov 08 '19 at 12:56
@NetRocks - This is not possible with python. You would need to modify the underlying API which is written in Java. The reason for this is Spark uses Hadoop File Format. — Jayadeep Jayaraman, Nov 10 '19 at 05:16

Ways to save RDD to S3 using pyspark

0 Answers0