I have a simple glue etl job which is triggered by Glue workflow. It drop duplicates data from a crawler table and writes back the result into a S3 bucket. The job is completed successfully . However the empty folders that spark generates "$folder$" remain in s3. It does not look nice in the hierarchy and causes confusion. Is there any way to configure spark or glue context to hide/remove these folders after successful completion of the job?
Asked
Active
Viewed 5,319 times
13
-
1According to [this](https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/), it's caused by Hadoop. I guess you can use a lambda function to delete `$folder$` on the S3 object creation event. – Achyut Vyas Jan 12 '21 at 06:56
-
Thanks for your comment @AchyutVyas . I would prefer to avoid manual deletion ... The strange thing is that those $folder$ s are not always created... I suspect that when I trigger the glue job manually it, it does not create the folder but when I use the workflow , those folders are created. Not sure though! Have to test again. Will update the question shortly... – Lina Jan 12 '21 at 07:35
-
Hey, @Lina using lambda to delete `$folder$` is not manual deletion. will you please also test that after deleting `$folder$` is it also gets created using the same method of job trigger? – Achyut Vyas Jan 12 '21 at 14:26
-
1Thank @AchyutVyas. By saying manual I mean doing extra actions to hide/delete the folder.I would prefer to configure the spark in a way that it will not generate the folder at all. I was testing this locally and found an interesting thing.If I use the [AWS glue lib] (https://learning.tusharsarde.com/2019/11/run-aws-glue-job-in-pycharm-community-edition.html) to run glue jobs locally,it does not create $folder$ in the cloud s3 bucket. I tried to use the same glue version, disable bookmark in the cloud job but the cloud job creates those folders.still testing to see what the difference is – Lina Jan 13 '21 at 07:54
-
1@AchyutVyas found the answer. Please see my answer below. – Lina Jan 15 '21 at 11:44
1 Answers
21
Ok finally after few days of testing I found the solution. Before pasting the code let me summarize what I have found ...
- Those $folder$ are created via Hadoop .Apache Hadoop creates these files when to create a folder in an S3 bucket. Source1 They are actually directory markers as path + /. Source 2
- To change the behavior , you need to change the Hadoop S3 write configuration in Spark context. Read this and this and this
- Read about S3 , S3a and S3n here and here
- Thanks to @stevel 's comment here
Now the solution is to set the following configuration in Spark context Hadoop.
sc = SparkContext()
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
To avoid creation of SUCCESS files you need to set the following configuration as well :
hadoop_conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
Make sure you use the S3 URI for writing to s3 bucket. ex:
myDF.write.mode("overwrite").parquet('s3://XXX/YY',partitionBy['DDD'])

Lina
- 1,217
- 1
- 15
- 28
-
-
1Thanks a bunch!! This is also great to set if you're looking to give your glue jobs least privilege roles - ie access to a particular path in s3 rather than full bucket access! I was getting permission denied and it was all to do with trying to write these _$folder$ files! – Jack Apr 09 '22 at 00:56