Imagine that we have a directory structure/partitioning of the data as:
/foo/day=1/lots/of/other/stuff/
/foo/day=2/lots/of/other/stuff/
/foo/day=3/lots/of/other/stuff/
.
.
/foo/day=25/lots/of/other/stuff/
I want to read only data of the highest increment of day
, here /foo/day=25/lots/of/other/stuff/
.
If day
is a column in the data we can do something like:
spark.read.parquet("s3a://foo/day=*/")
.withColumn("latestDay",max(col("day")).over())
.filter(col("day")===col("latestDay"))
Can you propose something smarter assuming that day is not a column?
Data wasn't written using write.partitionBy("day")
or similar. In my case schema in the subpaths aren't even necessarily meaningfully coherent.
Maybe there's a path glob pattern to do this or similar? Or is it performance-wise equivalent to define the day column and hope for predicate-push or similar optimisations?