3

I want to read all parquet files from an S3 bucket, including all those in the subdirectories (these are actually prefixes).

Using wildcards (*) in the S3 url only works for the files in the specified folder. For example using this code will only read the parquet files below the target/ folder.

df = spark.read.parquet("s3://bucket/target/*.parquet")
df.show()

Let say i have a structure like this in my s3 bucket:

"s3://bucket/target/2020/01/01/some-file.parquet"
"s3://bucket/target/2020/01/02/some-file.parquet"

The above code will raise the exception:

pyspark.sql.utils.AnalysisException: 'Path does not exist: s3://mailswitch-extract-underwr-prod/target/*.parquet;'

How can I read all the parquet files from the subdirectories from my s3 bucket?

To run my code, I am using AWS Glue 2.0 with Spark 2.4 and python 3.

Vincent Claes
  • 3,960
  • 3
  • 44
  • 62
  • 1
    you only need a `basePath` when you're providing a list of specific files within that path. @Surya Shekhar Chakraborty answer is what you need. – jayrythium Oct 08 '20 at 15:03
  • Thanks. while digging down this issue. it looks more to be a problem of reading s3 "sub directories" and using wildcards. I updated the original question. – Vincent Claes Oct 12 '20 at 07:08

2 Answers2

5

If you want to read all parquet files below the target folder

"s3://bucket/target/2020/01/01/some-file.parquet"
"s3://bucket/target/2020/01/02/some-file.parquet"

you can do

df = spark.read.parquet("bucket/target/*/*/*/*.parquet")

The downside is that you need to know the depth of your parquet files.

Vincent Claes
  • 3,960
  • 3
  • 44
  • 62
0

This worked for me:

df = spark.read.parquet("s3://your/path/here/some*wildcard")
  • 1
    While investigating this i found out this only works for files right below the here/ folder. but for files in subdirectories this doesn't work. i'll update my original question – Vincent Claes Oct 12 '20 at 07:01