1

I have some parquet files in my hdfs directory /dir1/dir2/. The name of the files contain some timestamps but those are pretty random. For example, one file path is: /dir1/dir2/2022-06-16-03-12-36-086.snappy.parquet, where 2022, 06, 16 are year, month and date respectively and 03, 12, 36, 086 are hours, minutes, seconds and milliseconds respectively.

Now, if I try to read all the files that have timestamp values between 2022-06-16-04-15-00-000 to 2022-06-16-05-15-00-000 using the following code:

paths = [f'/dir1/dir2/{tm.date()}-{tm.hour:02d}-{tm.minute:02d}-*' \
            for tm in pd.date_range('2022-06-16 04:15:00','2022-06-16 05:15:00', freq = 'min')]

df = spark.read.parquet(*paths)

Please note that the path does not exist for all the minutes or seconds or milliseconds. Doing this, I am getting the following error -

AnalysisException: Path does not exist: /dir1/dir2/2022-06-16-04-23-*.snappy.parquet

To handle this error, I tried adding the following to my spark configuration:

("spark.sql.files.ignoreMissingFiles", "true")

But still, the error persists. How to read only the paths that do exist and not get error for the paths that don't exist?

  • 2
    can you check if the file exists before reading it? -- https://stackoverflow.com/questions/82831/how-do-i-check-whether-a-file-exists-without-exceptions – samkart Jun 16 '22 at 08:39
  • `spark.sql.files.ignoreMissingFiles` works only after the dataframe has been constructed, try following this answer [here](https://stackoverflow.com/a/31784292/9477843) for your case – AdibP Jun 17 '22 at 01:34
  • 1
    @samkart that answer would check files in the local dir, not in hdfs. – aishik roy chaudhury Jun 17 '22 at 05:19

0 Answers0