How to automatically overwrite files in output_path with rdd.saveAsPickleFile(output_path)?

Question

Writing files via rdd.saveAsPickleFile(output_path) fails if the directory already exists. While that is a good thing to avoid accidental file deletion, I was wondering if there was an option to explicitly overwrite the folder/the files? Similarly to dataframes:

df.write.mode('overwrite').format('json').save(output_path)

Note: the following two questions here and here have asked this before but not received explicit answers.

And if you're still in doubt [How to set hadoop configuration values from pyspark](https://stackoverflow.com/q/27033823/8371915) — Alper t. Turker, Apr 10 '18 at 12:33
In that question, it is suggested to convert to dataframes. As opposed to that, I would like to know if it is possible to overwrite the output directly from rdd.saveAsPickleFile(output_path) — Georges Kohnen, Apr 10 '18 at 12:37
Accepted answer specifically describes how to overwrite with legacy API. You won't get more explicit answer than that. — Alper t. Turker, Apr 10 '18 at 13:04

score -4 · Accepted Answer · answered Apr 14 '18 at 13:20

-4

If you would like to explicitly delete the folder where your pickle file is created each time you run your script. You could delete the directory at output_path as described in https://stackoverflow.com/a/10840586/5671433 before you call

df.write.mode('overwrite').format('json').save(output_path)

answered Apr 14 '18 at 13:20

Moiz Mansur

180
3
13

1

The question was about to overwrite an **RDD**, not **Dataframe**. – aiman Nov 25 '19 at 15:09
The question was about to overwrite an RDD, not Dataframe – jaycode Jan 16 '20 at 10:18

How to automatically overwrite files in output_path with rdd.saveAsPickleFile(output_path)?

1 Answers1