How can I configure spark so that it creates "_$folder$" entries in S3?

Question

When I write my dataframe to S3 using

df.write
  .format("parquet")
  .mode("overwrite")
  .partitionBy("year", "month", "day", "hour", "gen", "client")
  .option("compression", "gzip")
  .save("s3://xxxx/yyyy")

I get the following in S3

year=2018
year=2019

but I would like to have this instead:

year=2018
year=2018_$folder$
year=2019
year=2019_$folder$

The scripts that are reading from that S3 location depend on the *_$folder$ entries, but I haven't found a way to configure spark/hadoop to generate them.

Any idea on what hadoop or spark configuration setting control the generation of *_$folder$ files?

I have the opposite scenario , where the Spark context in Glue , generates those empty folders and I am trying to figure out how to disable the creating process... Why do you want the spark to generate those folders? Are they useful? For later analytics? performance? Should I keep them or it is ok to remove with a lambda function?[This](https://stackoverflow.com/questions/65667996/how-to-configure-spark-glue-to-avoid-creation-of-empty-folder-after-glue-j?noredirect=1#comment116124035_65667996) is the issue that I am trying to solve. — Lina, Jan 12 '21 at 10:20

score 1 · Answer 1 · answered Apr 15 '19 at 20:05

1

those markers a legacy feature; I don't think anything creates them any more...though they are often ignored when actually listing directories. (that is, even if there, they get stripped from listings and replaced with directory entries).

answered Apr 15 '19 at 20:05

stevel

12,567
1
39
50

2

i'm using EMR-6.4.0 (Spark 3.1.2) and i'm still suffering from them. to make thinks even worse, it's not consistent, meaning not all s3 folders have the `*_$folder$`, only like 80% of them have it – Zach Jul 07 '22 at 14:32

How can I configure spark so that it creates "_$folder$" entries in S3?

1 Answers1

Linked