Reading data from s3 subdirectories in PySpark

Question

I want to read all parquet files from an S3 bucket, including all those in the subdirectories (these are actually prefixes).

Using wildcards (*) in the S3 url only works for the files in the specified folder. For example using this code will only read the parquet files below the target/ folder.

df = spark.read.parquet("s3://bucket/target/*.parquet")
df.show()

Let say i have a structure like this in my s3 bucket:

"s3://bucket/target/2020/01/01/some-file.parquet"
"s3://bucket/target/2020/01/02/some-file.parquet"

The above code will raise the exception:

pyspark.sql.utils.AnalysisException: 'Path does not exist: s3://mailswitch-extract-underwr-prod/target/*.parquet;'

How can I read all the parquet files from the subdirectories from my s3 bucket?

To run my code, I am using AWS Glue 2.0 with Spark 2.4 and python 3.

you only need a `basePath` when you're providing a list of specific files within that path. @Surya Shekhar Chakraborty answer is what you need. — jayrythium, Oct 08 '20 at 15:03
Thanks. while digging down this issue. it looks more to be a problem of reading s3 "sub directories" and using wildcards. I updated the original question. — Vincent Claes, Oct 12 '20 at 07:08

Vincent Claes · Accepted Answer · 2020-10-12T11:46:26.470

5

If you want to read all parquet files below the target folder

"s3://bucket/target/2020/01/01/some-file.parquet"
"s3://bucket/target/2020/01/02/some-file.parquet"

you can do

df = spark.read.parquet("bucket/target/*/*/*/*.parquet")

The downside is that you need to know the depth of your parquet files.

edited Oct 12 '20 at 11:46

answered Oct 12 '20 at 07:33

Vincent Claes

3,960
3
44
62

score 0 · Answer 2 · answered Oct 08 '20 at 11:22

0

This worked for me:

df = spark.read.parquet("s3://your/path/here/some*wildcard")

answered Oct 08 '20 at 11:22

Surya Shekhar Chakraborty

341
2
12

1

While investigating this i found out this only works for files right below the here/ folder. but for files in subdirectories this doesn't work. i'll update my original question – Vincent Claes Oct 12 '20 at 07:01

Reading data from s3 subdirectories in PySpark

2 Answers2

Linked