2

I have a simple spark job that reads a file from s3, takes five and writes back in s3. What I see is that there is always additional file in s3, next to my output "directory", which is called output_$folder$.

What is it? How I can prevent spark from creating it? Here is some code to show what I am doing...

x = spark.sparkContext.textFile("s3n://.../0000_part_00")
five = x.take(5)
five = spark.sparkContext.parallelize(five)
five.repartition(1).saveAsTextFile("s3n://prod.casumo.stu/dimensions/output/")

After the job I have s3 "directory" called output which contains results and another s3 object called output_$folder$ which I don't know what it is.

ezamur
  • 2,064
  • 2
  • 22
  • 39

3 Answers3

7

Changing S3 paths in the application from s3:// to s3a:// seems to have done the trick for me. The $folder$ files are no longer getting created since I started using s3a://.

k0L1081
  • 159
  • 1
  • 4
6

Ok, it seems I found out what it is. It is some kind of marker file, probably used for determining if the S3 directory object exists or not. How I reached this conclusion? First, I found this link that shows the source of

org.apache.hadoop.fs.s3native.NativeS3FileSystem#mkdir

method: http://apache-spark-user-list.1001560.n3.nabble.com/S3-Extra-folder-files-for-every-directory-node-td15078.html

Then I googled other source repositories to see if I am going to find different version of the method. I didn't.

At the end, I did an experiment and rerun the same spark job after I removed the s3 output directory object but left output_$folder$ file. Job failed saying that output directory already exists.

My conclusion, this is hadoop's way to know if there is a directory in s3 with given name and I will have to live with that.

All the above happens when I run the job from my local, dev machine - i.e. laptop. If I run the same job from a aws data pipeline, output_$folder$ does not get created.

ezamur
  • 2,064
  • 2
  • 22
  • 39
1

s3n:// and s3a:// doesn't generate marker directory like <output>_$folder$

If you are using hadoop with AWS EMR., I found moving from s3 to s3n is straight forward since they both use same file system implementation, whereas s3a involves AWS credential related code change.

('fs.s3.impl', 'com.amazon.ws.emr.hadoop.fs.EmrFileSystem')
('fs.s3n.impl', 'com.amazon.ws.emr.hadoop.fs.EmrFileSystem')
('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
Sairam Krish
  • 10,158
  • 3
  • 55
  • 67
  • 1
    s3n is obsolete and removed from recent hadoop releases. S3A does still create markers, but with a trailing /. It deletes them after adding files underneath; latest builds let you skip that delete (performance), at the cost of incompatibility. – stevel Aug 11 '21 at 11:01
  • @stevel I agree with you. Is there a way to avoid this markers in `s3` ? – Sairam Krish Aug 11 '21 at 12:06
  • S3 is EMR;s closed source FS. IF you are a customer of AWS, talk to them direct – stevel Aug 20 '21 at 13:25