-1

All methods that are mentioned in doing this seem to provide a folder, not a single .csv file that can be read to pandas dataframe.

2 Answers2

1

If your PySpark dataframe is not that large, you can simply convert it to Pandas dataframe then save it to the disk:

# type(df) -> pyspark.sql.dataframe.DataFrame
df.toPandas.to_csv('out.csv')

However if it was large, you couldn't load it into a Pandas dataframe.

You can also use write.csv to write your data in multiple file and use a loop to read all chunks with Pandas:

import pathlib

df.write.option('header', True).csv('bigdata')

data = []
for csvfile in pathlib.Path('bigdata').glob('*.csv'):
    df = pd.read_csv(csvfile)
    data.append(df)
dfPD = pd.concat(data)
Corralien
  • 109,409
  • 8
  • 28
  • 52
1

You can use one of the following options:

  1. If the file fit on a single node and you have enough free disk space on the node you can use following line.

    df.toPandas().to_csv("fileName.csv", header=True)

  2. You can try hdfs merge to merge all partitions onto a single file as follows. You can use subprocess to call command below in pyspark.

    hdfs dfs -getmerge /path_to_hdfs/*.* /local_path/fileName.csv

  3. If file must be on hdfs then you can use the answer on Spark dataframe save in single file on hdfs location

Hope these helps.

ozlemg
  • 436
  • 2
  • 10