1

Writing files via rdd.saveAsPickleFile(output_path) fails if the directory already exists. While that is a good thing to avoid accidental file deletion, I was wondering if there was an option to explicitly overwrite the folder/the files? Similarly to dataframes:

df.write.mode('overwrite').format('json').save(output_path)

Note: the following two questions here and here have asked this before but not received explicit answers.

Georges Kohnen
  • 170
  • 1
  • 10
  • And if you're still in doubt [How to set hadoop configuration values from pyspark](https://stackoverflow.com/q/27033823/8371915) – Alper t. Turker Apr 10 '18 at 12:33
  • In that question, it is suggested to convert to dataframes. As opposed to that, I would like to know if it is possible to overwrite the output directly from rdd.saveAsPickleFile(output_path) – Georges Kohnen Apr 10 '18 at 12:37
  • Accepted answer specifically describes how to overwrite with legacy API. You won't get more explicit answer than that. – Alper t. Turker Apr 10 '18 at 13:04

1 Answers1

-4

If you would like to explicitly delete the folder where your pickle file is created each time you run your script. You could delete the directory at output_path as described in https://stackoverflow.com/a/10840586/5671433 before you call

df.write.mode('overwrite').format('json').save(output_path)
Moiz Mansur
  • 180
  • 3
  • 13