I have many text files stored in S3, almost partitioned, but not quite. I want to read and union all of them. Keys have prefixes like:
s3_keys = ['s3a://prd-bucket//PROD/data/2021-04-16/part-2144.log',
's3a://prd-bucket//PROD/data/2021-04-16/part-2146.log',
's3a://prd-bucket//PROD/data/2021-04-16/part-2148.log',
's3a://prd-bucket//PROD/data/2021-04-16/part-2150.log',
's3a://prd-bucket//PROD/data/2021-04-16/part-2164.log',
's3a://prd-bucket//PROD/data/2021-04-16/part-requeue-client-xxx.log'
]
I tried following this answer, but for some strange reason, spark.read.text
mangles the filesystem prepended in each path after the first:
df = (
spark.read.option('mode', 'FAILFAST')
.text(','.join(s3_keys))
)
Py4JJavaError: An error occurred while calling o40880.text. : org.apache.spark.sql.AnalysisException: Path does not exist: s3a://prd-bucket/PROD/data/2021-04-16/part-2144.log,s3a:/prd-bucket/PROD/data/2021-04-16/part-2146.log ...
Note the s3a:/
instead of s3a://
. Why is this happening? I'm also curious if there is a limit for this sort of path munging...something like AmazonS3Exception: Request-URI Too Large
Similar question asked here, but I need a solution for PySpark (2.4) on S3.