2

I had a hard time last week getting data out of Spark, in the end I had to simply go with

df.toPandas().to_csv('mycsv.csv')

out of this answer.

I had tested the more native

df.write.csv('mycsv.csv')

for Spark 2.0+ but as per the comment underneath, it drops a set of csv files instead of one which need to be concatenated, whatever that means in this context. It also dropped an empty file into the directory called something like 'success'. The directory name was /mycsv/ but the csv itself had an unintelligible name out of a long string of characters.

This was the first I had heard of such a thing. Well, Excel has multiple tabs which must somehow be reflected in an .xls file, and NumPy arrays can be multidimensional, but I thought a csv file was just a header, values separated into columns by commas in rows.

Another answer suggested:

query.repartition(1).write.csv("cc_out.csv", sep='|')

So this drops just one file and the blank 'success' file, still the file does not have the name you want, the directory does.

Does anyone know why Spark is doing this, why will it not simply output a csv, how does it name the csv, what is that success file supposed to contain, and if concatenating csv files means here joining them vertically, head to tail.

cardamom
  • 6,873
  • 11
  • 48
  • 102

3 Answers3

9

There are a few reasons why Spark outputs multiple CSVs:
- Spark runs on a distributed cluster. For large datasets, all the data may not be able to fit on a single machine, but it can fit across a cluster of machines. To write one CSV, all the data would presumably have to be on one machine and written by one machine, which one machine may not be able to do.
- Spark is designed for speed. If data lives on 5 partitions across 5 executors, it makes sense to write 5 CSVs in parallel rather than move all data to a single executor and have one executor write the entire dataset.

If you need one CSV, my presumption is that your dataset is not super large. My recommendation is to download all the CSV files into a directory, and run cat *.csv > output.csv in the relevant directory. This will join your CSV files head-to-tail. You may need to do more work to strip headers from each part file if you're writing with headers.

chris.mclennon
  • 966
  • 10
  • 25
  • 3
    I read every answer here several times, all helpful, but your two points really address the **why** of it best, hence accepted. Will have to brush up on all these terms - master node, executor, partitions, machines, Hadoop file system for next time I have to use it. – cardamom Sep 18 '17 at 20:20
6

Does anyone know why Spark is doing this, why will it not simply output a csv,

Because it is designed for distributed computing where each chunk of data (a.k.a. partition) is written independently of others.

how does it name the csv

Name depends on the partition number.

what is that success file supposed to contain

Nothing. It just indicates success.

6

This basically happens because Spark dumps file based on the number of partitions between which the data is divided. So, each partition would simply dump it's own file seperately. You can use the coalesce option to save them to a single file. Check this link for more info.

However, this method has a disadvantage that it needs to collect all the data in the Master Node, hence the Master Node should contain enough memory. A workaround for this can seen in this answer.

This link also sheds some more information about this behavior of Spark:

Spark is like Hadoop - uses Hadoop, in fact - for performing actions like outputting data to HDFS. You'll know what I mean the first time you try to save "all-the-data.csv" and are surprised to find a directory named all-the-data.csv/ containing a 0 byte _SUCCESS file and then several part-0000n files for each partition that took part in the job.

Gambit1614
  • 8,547
  • 1
  • 25
  • 51