How to filter s3 path while reading data from s3 using pyspark

Question

I have a s3 folder structure like this:

bucketname/20211127123456/.parquet files
bucketname/20211127456789/.parquet files
bucketname/20211126123455/.parquet files
bucketname/20211126746352/.parquet files
bucketname/20211124123455/.parquet files
bucketname/20211124746352/.parquet files

Basically for each day there are two folders and inside that I have multiple parquet files which I want to read. Let's say I want to read all files from the folders for 27th and 26th Nov.

Right now I have boto3 function which is giving me a python list that includes all parquet files complete s3 path which has 20211126 and 20211127 in the s3 path and that list I am passing to spark.read. Is there any better way to achieve this?

Does this answer your question? [How to use regex to include/exclude some input files in sc.textFile?](https://stackoverflow.com/questions/31782763/how-to-use-regex-to-include-exclude-some-input-files-in-sc-textfile) — vladsiv, Nov 27 '21 at 10:33

score 1 · Accepted Answer · answered Nov 28 '21 at 20:04

Yes, you should be partitioning your data based on date. Then your spark queries would only need to include date parameters and only the files related to that date would be read for the query.

Here's an example of how that works with Athena; It will work with Glue and Spark too.

How to filter s3 path while reading data from s3 using pyspark

1 Answers1