I am trying to save an RDD to AWS S3 using pyspark, but get a "directory already exists" error.
The statement below works fine when "content1" folder is not present. But if I wanted to save additional files to the same folder, it gives me the above error?
rddFilteredData.repartition(5).saveAsTextFile("s3a://partners/research/content1", compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")
Also, when the above command works, it creates these part-00000x.gz which is fine, but
- How do I give them proper names such as research-results-00000x.gz?
- Does it mean that I go to save additional files to the "content1" folder, I need to remove/move the already existing files because it will cause name conflicts?
So, therefore, what is the right way to save RDD to existing bucket / folder to handle the above scenarios? Thanks in advance.