Pyspark read file only if it exists

Question

I have some parquet files in my hdfs directory /dir1/dir2/. The name of the files contain some timestamps but those are pretty random. For example, one file path is: /dir1/dir2/2022-06-16-03-12-36-086.snappy.parquet, where 2022, 06, 16 are year, month and date respectively and 03, 12, 36, 086 are hours, minutes, seconds and milliseconds respectively.

Now, if I try to read all the files that have timestamp values between 2022-06-16-04-15-00-000 to 2022-06-16-05-15-00-000 using the following code:

paths = [f'/dir1/dir2/{tm.date()}-{tm.hour:02d}-{tm.minute:02d}-*' \
            for tm in pd.date_range('2022-06-16 04:15:00','2022-06-16 05:15:00', freq = 'min')]

df = spark.read.parquet(*paths)

Please note that the path does not exist for all the minutes or seconds or milliseconds. Doing this, I am getting the following error -

AnalysisException: Path does not exist: /dir1/dir2/2022-06-16-04-23-*.snappy.parquet

To handle this error, I tried adding the following to my spark configuration:

("spark.sql.files.ignoreMissingFiles", "true")

But still, the error persists. How to read only the paths that do exist and not get error for the paths that don't exist?

can you check if the file exists before reading it? -- https://stackoverflow.com/questions/82831/how-do-i-check-whether-a-file-exists-without-exceptions — samkart, Jun 16 '22 at 08:39
`spark.sql.files.ignoreMissingFiles` works only after the dataframe has been constructed, try following this answer [here](https://stackoverflow.com/a/31784292/9477843) for your case — AdibP, Jun 17 '22 at 01:34
@samkart that answer would check files in the local dir, not in hdfs. — aishik roy chaudhury, Jun 17 '22 at 05:19

Pyspark read file only if it exists

0 Answers0