I am aware that there have been questions regarding wildcards in pySparks .load()
-function like here or here.
Anyhow, none of the questions/answers I found dealt with my variation of it.
Context
In pySpark I want to load files directly from HDFS because I have to use databricks avro-library for Spark 2.3.x. I'm doing so like this:
partition_stamp = "202104"
df = spark.read.format("com.databricks.spark.avro") \
.load(f"/path/partition={partition_stamp}*") \
.select("...")
As you can see the partitions are deriving from timestamps in the format yyyyMMdd
.
Question
Currently I only get all partitions used for April 2021 (partition_stamp = "202104"
).
However, I need all partitions starting from April 2021.
Written in pseudo-code, I'd need a solution something alike this:
.load(f"/path/partition >= {partition_stamp}*")
Since there actually exist several hundred partitions it is no use to do it in any fashion that requires hard-coding.
So my question is: Is there a function for conditional file-loading?