1

I am writing data into s3 bucket and creating parquet files using pyspark . MY bucket structure looks like below:

s3a://rootfolder/subfolder/table/

subfolder and table these two folders should be created at run time if folders do not exist , and if folders exist parquet files should inside folder table .

when I am running pyspark program from local machine it creates extra folder with _$folder$ (like table_$folder$ ) but if same program is run from emr it creates with _SUCCESS .

writing into s3: (pyspark program)
 data.write.parquet("s3a://rootfolder/sub_folder/table/", mode="overwrite")

is there way that creates only folder in s3 if do not exist and do not create folders like table_$folder$ or with _SUCCESS .

Jay
  • 296
  • 10
  • 25
  • it will throw an error if data already exists, I am asking if folder exists let it be there but actually if it does not exist it should create folder in s3 bucket , that is happening but one extra folder with suffice foldername_$folder$ if I am running job from my local machine , and if I am running in aws emr then it gets created with _SUCCESS. – Jay Dec 03 '20 at 12:36
  • Note that S3 does not have folders. A "folder" is just a part of the key between two slashes. You cannot create folders in S3. – luk2302 Dec 03 '20 at 12:37
  • @Jay . How have you created folder ? Can you please post code – Xi12 Mar 09 '22 at 20:40

2 Answers2

1

s3a connector (org.apache.hadoop.fs.s3a.S3AFileSystem) doesn't create $folder$ files. It generates directory markers as path + /, . For example, mkdir s3a://bucket/a/b creates a zero bytes marker object /a/b/. This differentiates it from a file, which would have the path /a/b

  1. If, locally, you are using the s3n: URL. Stop it. use the S3a connector.
  2. If you have been setting the fs.s3a.impl option: stop it. hadoop knows what to use, and it uses the S3AFileSystem class
  3. If you are seeing them and you are running EMR, that's EMR's connector. Closed source, out of scope.
stevel
  • 12,567
  • 1
  • 39
  • 50
  • I ma using below config with pyspark : return SparkSession \ .builder \ .appName(app_name) \ .config('spark.cassandra.connection.host', 'localhost') \ .config('spark.cassandra.connection.port', '9042') \ .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ .config("spark.executor.memory", "3g") \ .config("spark.driver.memory", "3g") \ .config("spark.executor.cores","2")\ .config("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")\ – Jay Dec 04 '20 at 11:53
  • if I am not using config("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")\ it is throwing erorr " java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManagerConfiguration.setMultipartUploadThreshold(I)V" – Jay Dec 04 '20 at 12:05
  • you want to make $folder$ go away, you get to sort out your classpaths. Read the hadopop s3a documentation. Changing s3a impl to NativeS3FileSystem means "I want to use the s3n connector with s3a URLS" – stevel Dec 08 '20 at 12:20
0

Generally, as it was mentioned in the comments on s3 everything is either Bucket or Object: However, the folder structure is more a visual representation and not an actual hierarchy like in a traditional filesystem.
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html
For this reason, you have to only create the Buckets and don't need to create the folders. It will only fail if the bucket+key combination already exists.

About the _$folder$ I'm not sure, I haven't seen those, it seems its created by Hadoop: https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/
Junk Spark output file on S3 with dollar signs
How can I configure spark so that it creates "_$folder$" entries in S3?

About the _SUCCESS file: This basically indicates, that your job is completed successfully. Your can disable it with :

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
Adam Dukkon
  • 273
  • 1
  • 6