Read Multiple Text Files in PySpark

Question

I have many text files stored in S3, almost partitioned, but not quite. I want to read and union all of them. Keys have prefixes like:

s3_keys = ['s3a://prd-bucket//PROD/data/2021-04-16/part-2144.log',
 's3a://prd-bucket//PROD/data/2021-04-16/part-2146.log',
 's3a://prd-bucket//PROD/data/2021-04-16/part-2148.log',
 's3a://prd-bucket//PROD/data/2021-04-16/part-2150.log',
 's3a://prd-bucket//PROD/data/2021-04-16/part-2164.log',
 's3a://prd-bucket//PROD/data/2021-04-16/part-requeue-client-xxx.log'
]

I tried following this answer, but for some strange reason, spark.read.text mangles the filesystem prepended in each path after the first:

df = (
    spark.read.option('mode', 'FAILFAST')
    .text(','.join(s3_keys))
)

Py4JJavaError: An error occurred while calling o40880.text. : org.apache.spark.sql.AnalysisException: Path does not exist: s3a://prd-bucket/PROD/data/2021-04-16/part-2144.log,s3a:/prd-bucket/PROD/data/2021-04-16/part-2146.log ...

Note the s3a:/ instead of s3a://. Why is this happening? I'm also curious if there is a limit for this sort of path munging...something like AmazonS3Exception: Request-URI Too Large

Similar question asked here, but I need a solution for PySpark (2.4) on S3.

score 0 · Answer 1 · answered Feb 09 '22 at 03:53

You are passing multiple paths separated by , as one string thats why you are getting below error.

Py4JJavaError: An error occurred while calling o40880.text. : org.apache.spark.sql.AnalysisException: Path does not exist: s3a://prd-bucket/PROD/data/2021-04-16/part-2144.log,s3a:/prd-bucket/PROD/data/2021-04-16/part-2146.log ...

Pass multiple paths as varargs or list. Try below code.

df = (
    spark.read.option('mode', 'FAILFAST')
    .text(s3_keys)
)

Read Multiple Text Files in PySpark

1 Answers1