I am trying to save filtered dataframe back to the same source file.
I wrote below code to transform the content of each file in a directory to separate Dataframe, filter it and save it back to the same file
rdd = sparkSession.sparkContext.wholeTextFiles("/content/sample_data/test_data")
# collect the RDD to a list
list_elements = rdd.collect()
for element in list_elements:
path, data = element
df = spark.read.json(spark.sparkContext.parallelize([data]))
df = df.filter('d != 721')
df.write.save(path, format="json", mode="overwrite")
I was expecting that it will overwrite the file with the updated data, but it is creating a folder with the file name and creating below structure and part files:
How can I save each updated dataframe back to the same source file(.txt)? Thanks in Advance.