Save only the required CSV file using PySpark

Question

I am quite new to PySpark, I am trying to read and then save a CSV file using Azure Databricks.

After saving the file I see many other files like "_Committed","_Started","_Success" and finally the CSV file with a totally different name.

I have already checked using DataFrame repartition(1) and coalesce(1) but this only deals when the CSV file itself was partitioned by Spark. Is there anything that can be done using PySpark?

Does this answer your question? [How do you write a CSV back to Azure Blob Storage using Databricks?](https://stackoverflow.com/questions/63851044/how-do-you-write-a-csv-back-to-azure-blob-storage-using-databricks) — Axel R., Jul 01 '21 at 15:51

score 0 · Answer 1 · answered Jun 18 '21 at 18:21

0

You can do the following:

df.toPandas().to_csv(path/to/file.csv)

It will create a single file csv as you expect.

answered Jun 18 '21 at 18:21

Axel R.

1,141
7
22

score -1 · Answer 2 · answered Jun 18 '21 at 17:13

-1

Those are default Log files created when saving from PySpark . We can't eliminate this. Using coalesce(1) you can save in a single file without partition.

answered Jun 18 '21 at 17:13

Robinhood

92
2
10

Save only the required CSV file using PySpark

2 Answers2