Writing a big Spark Dataframe into a csv file

Question

I'm using Spark 2.3 and I need to save a Spark Dataframe into a csv file and I'm looking for a better way to do it.. looking over related/similar questions, I found this one, but I need a more specific:

If the DataFrame is too big, how can I avoid using Pandas? Because I used toCSV() function (code below) and it produced:

Out Of Memory error (could not allocate memory).

Is directly writing to a csv using file I/O a better way? Can it preserve the separators?

Using df.coalesce(1).write.option("header", "true").csv('mycsv.csv') will cause the header to be written in each file and when the files are merged, it will have headers in the middle. Am I wrong?

Using spark write and then hadoop getmerge is better than using coalesce from the point of performance?

def toCSV(spark_df, n=None, save_csv=None, csv_sep=',', csv_quote='"'):
        """get spark_df from hadoop and save to a csv file

        Parameters
        ----------
        spark_df: incoming dataframe
        n: number of rows to get
        save_csv=None: filename for exported csv

        Returns
        -------

        """

        # use the more robust method
        # set temp names
        tmpfilename = save_csv or (wfu.random_filename() + '.csv')
        tmpfoldername = wfu.random_filename()
        print n
        # write sparkdf to hadoop, get n rows if specified
        if n:
            spark_df.limit(n).write.csv(tmpfoldername, sep=csv_sep, quote=csv_quote)
        else:
            spark_df.write.csv(tmpfoldername, sep=csv_sep, quote=csv_quote)

        # get merge file from hadoop
        HDFSUtil.getmerge(tmpfoldername, tmpfilename)
        HDFSUtil.rmdir(tmpfoldername)

        # read into pandas df, remove tmp csv file
        pd_df = pd.read_csv(tmpfilename, names=spark_df.columns, sep=csv_sep, quotechar=csv_quote)
        os.remove(tmpfilename)

        # re-write the csv file with header!
        if save_csv is not None:
            pd_df.to_csv(save_csv, sep=csv_sep, quotechar=csv_quote)

It would make more sense if you reduce the function to represent essential functionality you want to implement. Asking _how can I avoid using Pandas_ when function returns `pandas.core.frame.DataFrame` make no sense. — Alper t. Turker, Jun 06 '18 at 08:16
The function goal is to save the spark dataframe into a csv file.. sorry I will remove the return. — SarahData, Jun 06 '18 at 08:24

score 0 · Answer 1 · answered Jun 06 '18 at 09:44

If the DataFrame is too big, how can I avoid using Pandas?

You can just save the file to HDFS or S3 or whichever distributed storage you have.

Is directly writing to a csv using file I/O a better way? Can it preserve the separators?

If you mean by that to save file to local storage - it will still cause OOM exception, since you will need to move all data in memory on local machine to do it.

Using df.coalesce(1).write.option("header", "true").csv('mycsv.csv') will cause the header to be written in each file and when the files are merged, it will have headers in the middle. Am I wrong?

In this case you will have only 1 file (since you do coalesce(1)). So you don't need to care about headers. Instead - you should care about memory on the executors - you might get OOM on the executor since all the data will be moved to that executor.

Using spark write and then hadoop getmerge is better than using coalesce from the point of performance?

Definitely better (but don't use coalesce()). Spark will efficiently write data to storage, then HDFS will duplicate data and after that getmerge will be able to efficiently read data from the nodes and merge it.

Actually, I'm saving to HDFS (that's what the getmerge does isn't it?), but I need to have it 1 csv file locally to use it in further operations.. so that I don't have to repeat reading it from HDFS which is expensive.. — SarahData, Jun 06 '18 at 11:01
`hadoop fs -getmerge [-nl] ` - it takes directory in HDFS and meges all files into one on local file system as per docs [here](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#getmerge). — Vladislav Varslavans, Jun 06 '18 at 11:09

score 0 · Answer 2 · answered Jun 06 '18 at 11:31

0

We used databricks library . It works fine

df.save("com.databricks.spark.csv", SaveMode.Overwrite, Map("delimiter" -> delim, "nullValue" -> "-", "path" -> tempFPath))

Library :

<!-- spark df to csv -->
    <dependency>
        <groupId>com.databricks</groupId>
        <artifactId>spark-csv_2.10</artifactId>
        <version>1.3.0</version>
    </dependency>

answered Jun 06 '18 at 11:31

Sandeep Das

1,010
9
22

does it work even for spark 2.x versions? I thought it's only for spark 1.x – SarahData Jun 06 '18 at 12:37
for spark 2, add spark-cav to your classpath, then go df.write.format("csv").save(path) – stevel Jun 06 '18 at 14:20

Writing a big Spark Dataframe into a csv file

2 Answers2