I am a newbie to Apache Spark and Pyspark. I have a use case where I have to read multiple files from different folders in s3 and then process the file contents for processing parallely. I have tried various ways and one of which is this way. I did not understand how to initialize s3 client inside the lambda body. I have been experiencing the same issue TypeError: can't pickle thread.lock objects
. How could I process the s3 files parallely and read the body of the object.
Here is the doe snippet after editing.
def f(key):
s3_client = boto3.client('s3')
body = s3_client.get_object(Bucket='bucket', Key=key)['Body'].read()
return body
data_rdd = sc.parallelize(keys_list).map(lambda key: f(key))