How to write csv file into one file by pyspark

Question

I use this method to write csv file. But it will generate a file with multiple part files. That is not what I want; I need it in one file. And I also found another post using scala to force everything to be calculated on one partition, then get one file.

First question: how to achieve this in Python?

In the second post, it is also said a Hadoop function could merge multiple files into one.

Second question: is it possible merge two file in Spark?

Mohamed Thasin ah · Answer 1 · 2018-10-05T09:11:39.007

34

You can use,

df.coalesce(1).write.csv('result.csv')

Note: when you use coalesce function you will lose your parallelism.

edited Oct 05 '18 at 09:11

answered May 03 '17 at 11:00

Mohamed Thasin ah

10,754
11
52
111

Yes this the case here weird – Nico Coallier Jul 18 '17 at 11:47
No log because the master crash ...:( – Nico Coallier Jul 18 '17 at 12:43
2

I haven't confirmed myself but suspect you run into memory problems with large files see https://stackoverflow.com/questions/31674530/write-single-csv-file-using-spark-csv/41785085#41785085 – citynorman Oct 15 '17 at 12:22

David · Answer 2 · 2020-06-08T14:56:45.587

6

You can do this by using the cat command line function as below. This will concatenate all of the part files into 1 csv. There is no need to repartition down to 1 partition.

import os
test.write.csv('output/test')
os.system("cat output/test/p* > output/test.csv")

edited Jun 08 '20 at 14:56

answered Apr 12 '16 at 13:30

David

11,245
3
41
46

I think this will not help when job is running on cluster mode, still there will be a different files on each executor. – RockStar Oct 31 '18 at 17:15
This will not work on cloud blobs such as AWS S3. https://stackoverflow.com/a/32467122/1001015 – SantiagoRodriguez May 05 '20 at 11:30
1

Watch out that if you choose to save the header this saves it for all the parts, so when you concatenate them together you will have headers that are now part of the data. – Nic Scozzaro Mar 19 '21 at 03:19

score 1 · Answer 3 · edited Jun 23 '16 at 23:41

Requirement is to save an RDD in a single CSV file by bringing the RDD to an executor. This means RDD partitions present across executors would be shuffled to one executor. We can use coalesce(1) or repartition(1) for this purpose. In addition to it, one can add a column header to the resulted csv file. First we can keep a utility function for make data csv compatible.

def toCSVLine(data):
    return ','.join(str(d) for d in data)

Let’s suppose MyRDD has five columns and it needs 'ID', 'DT_KEY', 'Grade', 'Score', 'TRF_Age' as column Headers. So I create a header RDD and union MyRDD as below which most of times keeps the header on top of the csv file.

unionHeaderRDD = sc.parallelize( [( 'ID','DT_KEY','Grade','Score','TRF_Age' )])\
   .union( MyRDD )

unionHeaderRDD.coalesce( 1 ).map( toCSVLine ).saveAsTextFile("MyFileLocation" )

saveAsPickleFile spark context API method can be used to serialize data that is saved in order save space. Use pickFile to read the pickled file.

score 1 · Answer 4 · answered Feb 23 '23 at 16:58

I needed my csv output in a single file with headers saved to an s3 bucket with the filename I provided. The current accepted answer, when I run it (spark 3.3.1 on a databricks cluster) gives me a folder with the desired filename and inside it there is one csv file (due to coalesce(1)) with a random name and no headers.

I found that sending it to pandas as an intermediate step provided just a single file with headers, exactly as expected.

my_spark_df.toPandas().to_csv('s3_csv_path.csv',index=False)

score 0 · Answer 5 · answered Aug 12 '23 at 08:37

I found this solution

df.coalesce(1).write.mode('overwrite').csv('test.csv')


from py4j.java_gateway import java_import
java_import(spark._jvm,'org.apache.hadoop.fs.Path')
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
file = fs.globStatus(spark._jvm.Path('test.csv/part*'))[0].getPath().getName()
fs.rename(spark._jvm.Path('test.csv/'+ file), spark._jvm.Path('test2.csv'))
fs.delete(spark._jvm.Path('test.csv'), True)

How to write csv file into one file by pyspark

5 Answers5