How to export a table dataframe in PySpark to csv?

Question

I am using Spark 1.3.1 (PySpark) and I have generated a table using a SQL query. I now have an object that is a DataFrame. I want to export this DataFrame object (I have called it "table") to a csv file so I can manipulate it and plot the columns. How do I export the DataFrame "table" to a csv file?

Thanks!

zero323 · Accepted Answer · 2017-03-03T17:38:36.057

261

If data frame fits in a driver memory and you want to save to local files system you can convert Spark DataFrame to local Pandas DataFrame using toPandas method and then simply use to_csv:

df.toPandas().to_csv('mycsv.csv')

Otherwise you can use spark-csv:

Spark 1.3

df.save('mycsv.csv', 'com.databricks.spark.csv')

Spark 1.4+

df.write.format('com.databricks.spark.csv').save('mycsv.csv')

In Spark 2.0+ you can use csv data source directly:

df.write.csv('mycsv.csv')

edited Mar 03 '17 at 17:38

answered Jul 13 '15 at 14:36

zero323

322,348
103
959
935

Super answer. For the first option, if I want to write to a pipe-delimited file rather than comma-delimited CSV, is this possible? – Brian Waters Jul 19 '17 at 11:16
10

If you have spark dataframes you can use `df.write.csv('/tmp/lookatme/')` and that will drop a set of csv files in `/tmp/lookatme` Using spark is significantly faster than serializing it in pandas. The only drawback is that you'll end up with a set of csvs instead of a single one and if the destination tool doesn't know how to concatenate them you will need to do it yourself. – Txangel Jul 19 '17 at 16:39
1

What a big deal it is to get a csv out of spark. Something interesting about that first solution is that `to_csv` works without needing to import Pandas. `.toPandas` is part of Spark maybe it implicitly imports it.. – cardamom Sep 14 '17 at 01:12
45

You should be able to use `df.coalesce(1).write.csv('mycsv.csv')` if you insist on having a single output file – MichaelChirico Feb 21 '18 at 03:31
1

@Txangel thanks for your answer. However, when I use that it runs without any error but I can't find *any* csv created in the target location. Any thoughts? – Rotail Jun 13 '19 at 17:14
@Rotail [That's because it is not how this works in general](https://stackoverflow.com/a/51603898/10938362). – user10938362 Jun 14 '19 at 22:37
This is for an RDD dataframe right? As opposed to an SQL dataframe? I think the author wanted something for the latter, which I do not see in this solution. – spacedustpi Dec 09 '19 at 18:14
4

using ```df.write.csv('mycsv.csv')``` exports the csv to hdfs environment. How can i get it in my local environment? – Tracy Jun 10 '20 at 14:57
@Tracy exact same question that I've been having. Have you found a solution to this? – lagrangian_headache May 12 '21 at 15:16
1

@etjk, you can copy a directory from hdfs to a local dir with: hdfs dfs -copyToLocal – Chingiz K. Sep 03 '21 at 02:37
Note that to `toPandas` may actually change your schema e.g. an int column containing null values will end up a float column in pandas (with NaNs). The actual integers e.g. "100" will be written out as "100.0". – deepNdope Mar 11 '22 at 14:00

score 47 · Answer 2 · answered Feb 08 '17 at 11:23

47

For Apache Spark 2+, in order to save dataframe into single csv file. Use following command

query.repartition(1).write.csv("cc_out.csv", sep='|')

Here 1 indicate that I need one partition of csv only. you can change it according to your requirements.

answered Feb 08 '17 at 11:23

Hafiz Muhammad Shafiq

8,168
12
63
121

8

As indicated here: https://spark.apache.org/docs/2.2.0/api/python/pyspark.html#pyspark.RDD.repartition it is recommended to use coalesce() instead of repartition() to increase performance ("If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.") – Seastar Nov 28 '18 at 15:34
3

@Seastar: While coalescing might have advantages in several use cases, your comment does not apply in this special case. If you want to have a .csv in your hdfs (or whatever), you will usually want one file and not dozens of files spreaded across your cluster (the whole sense of doing `repartition(1)`. You need to shuffle the data for this either way, so coalescing will not help at all in the bigger picture. – Markus Jun 24 '20 at 17:17

score 21 · Answer 3 · answered Mar 22 '16 at 20:30

If you cannot use spark-csv, you can do the following:

df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("file.csv")

If you need to handle strings with linebreaks or comma that will not work. Use this:

import csv
import cStringIO

def row2csv(row):
    buffer = cStringIO.StringIO()
    writer = csv.writer(buffer)
    writer.writerow([str(s).encode("utf-8") for s in row])
    buffer.seek(0)
    return buffer.read().strip()

df.rdd.map(row2csv).coalesce(1).saveAsTextFile("file.csv")

Gazal Patel · Answer 4 · 2019-04-23T17:36:32.607

You need to repartition the Dataframe in a single partition and then define the format, path and other parameter to the file in Unix file system format and here you go,

df.repartition(1).write.format('com.databricks.spark.csv').save("/path/to/file/myfile.csv",header = 'true')

Read more about the repartition function Read more about the save function

However, repartition is a costly function and toPandas() is worst. Try using .coalesce(1) instead of .repartition(1) in previous syntax for better performance.

Read more on repartition vs coalesce functions.

s510 · Answer 5 · 2021-11-29T11:22:40.000

8

Using PySpark

Easiest way to write in csv in Spark 3.0+

sdf.write.csv("/path/to/csv/data.csv")

this can generate multiple files based on the number of spark nodes you are using. In case you want to get it in a single file use repartition.

sdf.repartition(1).write.csv("/path/to/csv/data.csv")

Using Pandas

If your data is not too much and can be held in the local python, then you can make use of pandas too

sdf.toPandas().to_csv("/path/to/csv/data.csv", index=False)

Using Koalas

sdf.to_koalas().to_csv("/path/to/csv/data.csv", index=False)

edited Nov 29 '21 at 11:22

answered Nov 29 '21 at 11:09

s510

2,271
11
18

omg I have been looking for something like `.repartition(1)` for HOURS to write into only 1 csv file thank you so so so much!!!!!!!!!! – sweetmusicality Apr 25 '22 at 06:21
please upvote if that helped you ) – s510 May 09 '22 at 14:35
already did :-) – sweetmusicality Jun 02 '22 at 17:25
this is the correct answer. the others are 5 years old at least. – 123 Jul 06 '22 at 18:35

score 4 · Answer 6 · edited Nov 09 '21 at 20:56

4

How about this (in case you don't want a one liner) ?

for row in df.collect():
    d = row.asDict()
    s = "%d\t%s\t%s\n" % (d["int_column"], d["string_column"], d["string_column"])
    f.write(s)

f is an opened file descriptor. Also the separator is a TAB char, but it's easy to change to whatever you want.

edited Nov 09 '21 at 20:56

code rider

49
8

answered Jun 22 '16 at 14:06

Matei Florescu

1,155
11
23

score 2 · Answer 7 · answered Jun 29 '21 at 11:56

'''
I am late to the pary but: this will let me rename the file, move it to a desired directory and delete the unwanted additional directory spark made
'''

import shutil
import os
import glob

path = 'test_write'
#write single csv
students.repartition(1).write.csv(path)

#rename and relocate the csv
shutil.move(glob.glob(os.getcwd() + '\\' + path + '\\' + r'*.csv')[0], os.getcwd()+ '\\' + path+ '.csv')

#remove additional directory
shutil.rmtree(os.getcwd()+'\\'+path)

score 1 · Answer 8 · answered Apr 05 '22 at 08:34

I used the method with pandas and this gave me horrible performance. In the end it took so long that I stopped to look for another method.

If you are looking for a way to write to one csv instead of multiple csv's this would be what you are looking for:

df.coalesce(1).write.csv("train_dataset_processed", header=True)

It reduced processing my dataset from 2+ hours to 2 minutes

score 0 · Answer 9 · answered Oct 06 '21 at 08:15

0

try display(df) and use the download option in the results. Please note: only 1 million rows can be downloaded with this option but its really quick.

answered Oct 06 '21 at 08:15

Megha Jaiswal

550
5
11

How to export a table dataframe in PySpark to csv?

9 Answers9

Linked

Related