Pyspark- Save each dataframe to a single file

Question

I am trying to save filtered dataframe back to the same source file.

I wrote below code to transform the content of each file in a directory to separate Dataframe, filter it and save it back to the same file

rdd = sparkSession.sparkContext.wholeTextFiles("/content/sample_data/test_data")
# collect the RDD to a list
list_elements = rdd.collect()
for element in list_elements:
  path, data = element
  df = spark.read.json(spark.sparkContext.parallelize([data]))
  df = df.filter('d != 721')
  df.write.save(path, format="json", mode="overwrite")

I was expecting that it will overwrite the file with the updated data, but it is creating a folder with the file name and creating below structure and part files:

How can I save each updated dataframe back to the same source file(.txt)? Thanks in Advance.

score 0 · Answer 1 · answered Jun 02 '20 at 05:32

To save it to 1 file use .coalesce(1) or .repartition(1) option before .save(), that will result in the same folder-like structure, but there will be 1 json file inside.

To save it with a „normal” name after saving it you’d need to cut the 1 json file inside, paste and rename it with desired name. You can see code how it could look like for csv files here

Pyspark- Save each dataframe to a single file

1 Answers1