How to read parquet files in pyspark from s3 bucket whose path is partially unpredictable?

Question

My paths are of the format s3://my_bucket/timestamp=yyyy-mm-dd HH:MM:SS/.

E.g. s3://my-bucket/timestamp=2021-12-12 12:19:27/, however MM:SS part are not predictable, and I am interested in reading the data for a given hour. I tried the following:

df = spark.read.parquet("s3://my-bucket/timestamp=2021-12-12 12:*:*/")
df = spark.read.parquet("s3://my-bucket/timestamp=2021-12-12 12:[00,01-59]:[00,01-59]/")

but they give the error pyspark.sql.utils.IllegalArgumentException: java.net.URISyntaxException.

score 2 · Accepted Answer · answered Jan 04 '22 at 15:29

The problem is your path contains colons :. Unfortunately, it is still not supported. Here are some related tickets:

and threads:

Struggling with colon ':' in file names

I think the only way is rename these files...

score 0 · Answer 2 · answered Jan 04 '22 at 16:10

If you want performance.....

I humbly suggest that when you do re-architect this you don't use S3 file lists/directory lists to accomplish this. I suggest you use a Hive table partitioned by hour. (Or you write a job to help migrate data into hours in larger files not small files.)

S3 is a wonderful engine for long term cheap storage. It's not performant, and it is particularly bad at directory listing due to how they implemented it. (And performance only gets worse if there are multiple small files in the directories).

To get some real performance from your job you should use a hive table (Partitioned so the file lookups are done in DynamoDB, and the partition is at the hour level.) or some other groomed file structure that reduces file count/directories listings required.

You will see a large performance boost if you can restructure your data into bigger files without use of file lists.

How to read parquet files in pyspark from s3 bucket whose path is partially unpredictable?

2 Answers2