14

I want to save dataframe to s3 but when I save the file to s3 , it creates empty file with ${folder_name}, in which I want to save the file.

Syntax to save the dataframe :-

f.write.parquet("s3n://bucket-name/shri/test")

It saves the file in test folder but it creates $test under shri .

Is there a way I can save it without creating that extra folder?

Alexander
  • 105,104
  • 32
  • 201
  • 196
Shrikant
  • 753
  • 2
  • 10
  • 16
  • In order to write one file, you need to use one executor and one reducer, which defeats the purpose of Spark's distributed nature – OneCricketeer Aug 24 '17 at 20:09
  • 1
    @cricket_007 comment is sort of right. In order to write one file, you need one partition. You can use spark's distributed nature and then, right before exporting to csv, use df.coalesce(1) to return to one partition. To your point, if you use one partition to write out, one executor would be used to write which may hinder performance if the data amount is large. – Tanner Clark Jan 06 '20 at 18:10

2 Answers2

26

I was able to do it by using below code.

df.write.parquet("s3a://bucket-name/shri/test.parquet",mode="overwrite")
Uri Goren
  • 13,386
  • 6
  • 58
  • 110
Usman Azhar
  • 746
  • 5
  • 13
  • Thanks Usman for the response , Is there any module which needs to be imported because , when I tried the same , I am getting error.Traceback (most recent call last): File "", line 1, in NameError: name 'overwrite' is not defined – Shrikant Aug 28 '17 at 14:57
  • 4
    give overwrite value in quotes, i.e mode='overwrite' – Usman Azhar Aug 28 '17 at 22:00
4

As far as I know, there is no way to control the naming of the actual parquet files. When you write a dataframe to parquet, you specify what the directory name should be, and spark creates the appropriate parquet files under that directory.

Bob Swain
  • 3,052
  • 3
  • 17
  • 28