I have some parquet files in my hdfs
directory /dir1/dir2/
. The name of the files contain some timestamps but those are pretty random. For example, one file path is: /dir1/dir2/2022-06-16-03-12-36-086.snappy.parquet
, where 2022, 06, 16
are year, month and date respectively and 03, 12, 36, 086
are hours, minutes, seconds and milliseconds respectively.
Now, if I try to read all the files that have timestamp values between 2022-06-16-04-15-00-000
to 2022-06-16-05-15-00-000
using the following code:
paths = [f'/dir1/dir2/{tm.date()}-{tm.hour:02d}-{tm.minute:02d}-*' \
for tm in pd.date_range('2022-06-16 04:15:00','2022-06-16 05:15:00', freq = 'min')]
df = spark.read.parquet(*paths)
Please note that the path does not exist for all the minutes or seconds or milliseconds. Doing this, I am getting the following error -
AnalysisException: Path does not exist: /dir1/dir2/2022-06-16-04-23-*.snappy.parquet
To handle this error, I tried adding the following to my spark configuration:
("spark.sql.files.ignoreMissingFiles", "true")
But still, the error persists. How to read only the paths that do exist and not get error for the paths that don't exist?