How to save data in pyspark in 1 file in Amazon EMR

Question

I use next code for save data to local disk

receiptR.write.format('com.databricks.spark.csv').save('file:/mnt/dump/gp')

But I had next directory structure

[hadoop@ip-172-31-16-209 ~]$ cd /mnt/dump
[hadoop@ip-172-31-16-209 dump]$ ls -R
.:
gp
./gp:
_temporary
./gp/_temporary:
0

./gp/_temporary/0:
task_201610061116_0000_m_000000  _temporary

./gp/_temporary/0/task_201610061116_0000_m_000000:

part-00000

How I can save data in next structure?

/mnt/dump/gp/
part-00000

score 0 · Answer 1 · edited May 23 '17 at 12:19

The files are separated out one per partition. So if you were to view your data on its own, you'd see this.

rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9], 4) # as "4" partitions
rdd.collect()
--> [1, 2, 3, 4, 5, 6, 7, 8, 9]

and if you view it with partitions visible:

rdd.glom().collect() 
--> [[1, 2], [3, 4], [5, 6], [7, 8, 9]]

So when you save it, it will save the files broken into 4 pieces.

As others have suggested in similar questions, i.e. how to make saveAsTextFile NOT split output into multiple file? , you can coalesce the dataset down to 1 single partition and then save:

coalesce(1,true).saveAsTextFile("s3://myBucket/path/to/file.txt")

However, warning: the reason why Spark deals with data across multiple partitions in the first place is because for very large datasets, each node can deal with smaller data. When you coalesce down to 1 partition, you're forcing the entirety of the dataset into a single node. if you dont have the memory available for that, you'll get into trouble. Source: NullPointerException in Spark RDD map when submitted as a spark job

How to save data in pyspark in 1 file in Amazon EMR

1 Answers1