How do I write a pyspark dataframe into a CSV and read it into Pandas?

Question

All methods that are mentioned in doing this seem to provide a folder, not a single .csv file that can be read to pandas dataframe.

Does your spark dataframe is large? – Corralien Feb 21 '23 at 09:09 — Corralien, Feb 21 '23 at 09:09

Corralien · Accepted Answer · 2023-02-21T09:26:22.853

If your PySpark dataframe is not that large, you can simply convert it to Pandas dataframe then save it to the disk:

# type(df) -> pyspark.sql.dataframe.DataFrame
df.toPandas.to_csv('out.csv')

However if it was large, you couldn't load it into a Pandas dataframe.

You can also use write.csv to write your data in multiple file and use a loop to read all chunks with Pandas:

import pathlib

df.write.option('header', True).csv('bigdata')

data = []
for csvfile in pathlib.Path('bigdata').glob('*.csv'):
    df = pd.read_csv(csvfile)
    data.append(df)
dfPD = pd.concat(data)

score 1 · Answer 2 · answered Feb 21 '23 at 09:28

You can use one of the following options:

If the file fit on a single node and you have enough free disk space on the node you can use following line.

df.toPandas().to_csv("fileName.csv", header=True)
You can try hdfs merge to merge all partitions onto a single file as follows. You can use subprocess to call command below in pyspark.

hdfs dfs -getmerge /path_to_hdfs/*.* /local_path/fileName.csv
If file must be on hdfs then you can use the answer on Spark dataframe save in single file on hdfs location

Hope these helps.

How do I write a pyspark dataframe into a CSV and read it into Pandas?

2 Answers2