2

When I write my dataframe to S3 using

df.write
  .format("parquet")
  .mode("overwrite")
  .partitionBy("year", "month", "day", "hour", "gen", "client")
  .option("compression", "gzip")
  .save("s3://xxxx/yyyy")

I get the following in S3

year=2018
year=2019

but I would like to have this instead:

year=2018
year=2018_$folder$
year=2019
year=2019_$folder$

The scripts that are reading from that S3 location depend on the *_$folder$ entries, but I haven't found a way to configure spark/hadoop to generate them.

Any idea on what hadoop or spark configuration setting control the generation of *_$folder$ files?

RubenLaguna
  • 21,435
  • 13
  • 113
  • 151
  • I have the opposite scenario , where the Spark context in Glue , generates those empty folders and I am trying to figure out how to disable the creating process... Why do you want the spark to generate those folders? Are they useful? For later analytics? performance? Should I keep them or it is ok to remove with a lambda function?[This](https://stackoverflow.com/questions/65667996/how-to-configure-spark-glue-to-avoid-creation-of-empty-folder-after-glue-j?noredirect=1#comment116124035_65667996) is the issue that I am trying to solve. – Lina Jan 12 '21 at 10:20

1 Answers1

1

those markers a legacy feature; I don't think anything creates them any more...though they are often ignored when actually listing directories. (that is, even if there, they get stripped from listings and replaced with directory entries).

stevel
  • 12,567
  • 1
  • 39
  • 50
  • 2
    i'm using EMR-6.4.0 (Spark 3.1.2) and i'm still suffering from them. to make thinks even worse, it's not consistent, meaning not all s3 folders have the `*_$folder$`, only like 80% of them have it – Zach Jul 07 '22 at 14:32