Pyspark Save dataframe to S3

Question

I want to save dataframe to s3 but when I save the file to s3 , it creates empty file with ${folder_name}, in which I want to save the file.

Syntax to save the dataframe :-

f.write.parquet("s3n://bucket-name/shri/test")

It saves the file in test folder but it creates $test under shri .

Is there a way I can save it without creating that extra folder?

In order to write one file, you need to use one executor and one reducer, which defeats the purpose of Spark's distributed nature — OneCricketeer, Aug 24 '17 at 20:09
@cricket_007 comment is sort of right. In order to write one file, you need one partition. You can use spark's distributed nature and then, right before exporting to csv, use df.coalesce(1) to return to one partition. To your point, if you use one partition to write out, one executor would be used to write which may hinder performance if the data amount is large. — Tanner Clark, Jan 06 '20 at 18:10

score 26 · Answer 1 · edited Feb 15 '18 at 13:55

26

I was able to do it by using below code.

df.write.parquet("s3a://bucket-name/shri/test.parquet",mode="overwrite")

edited Feb 15 '18 at 13:55

Uri Goren

13,386
6
58
110

answered Aug 25 '17 at 02:54

Usman Azhar

746
5
13

Thanks Usman for the response , Is there any module which needs to be imported because , when I tried the same , I am getting error.Traceback (most recent call last): File "", line 1, in NameError: name 'overwrite' is not defined – Shrikant Aug 28 '17 at 14:57
4

give overwrite value in quotes, i.e mode='overwrite' – Usman Azhar Aug 28 '17 at 22:00

Bob Swain · Answer 2 · 2017-08-24T19:46:01.987

4

As far as I know, there is no way to control the naming of the actual parquet files. When you write a dataframe to parquet, you specify what the directory name should be, and spark creates the appropriate parquet files under that directory.

edited Aug 24 '17 at 19:46

answered Aug 24 '17 at 19:40

Bob Swain

3,052
3
17
28

Pyspark Save dataframe to S3

2 Answers2

Linked