All methods that are mentioned in doing this seem to provide a folder, not a single .csv file that can be read to pandas dataframe.
-
Does your spark dataframe is large? – Corralien Feb 21 '23 at 09:09
2 Answers
If your PySpark dataframe is not that large, you can simply convert it to Pandas dataframe then save it to the disk:
# type(df) -> pyspark.sql.dataframe.DataFrame
df.toPandas.to_csv('out.csv')
However if it was large, you couldn't load it into a Pandas dataframe.
You can also use write.csv
to write your data in multiple file and use a loop to read all chunks with Pandas:
import pathlib
df.write.option('header', True).csv('bigdata')
data = []
for csvfile in pathlib.Path('bigdata').glob('*.csv'):
df = pd.read_csv(csvfile)
data.append(df)
dfPD = pd.concat(data)

- 109,409
- 8
- 28
- 52
You can use one of the following options:
If the file fit on a single node and you have enough free disk space on the node you can use following line.
df.toPandas().to_csv("fileName.csv", header=True)
You can try hdfs merge to merge all partitions onto a single file as follows. You can use subprocess to call command below in pyspark.
hdfs dfs -getmerge /path_to_hdfs/*.* /local_path/fileName.csv
If file must be on hdfs then you can use the answer on Spark dataframe save in single file on hdfs location
Hope these helps.

- 436
- 2
- 10