Pyspark read files from s3 and parallelize the file list

Question

I am a newbie to Apache Spark and Pyspark. I have a use case where I have to read multiple files from different folders in s3 and then process the file contents for processing parallely. I have tried various ways and one of which is this way. I did not understand how to initialize s3 client inside the lambda body. I have been experiencing the same issue TypeError: can't pickle thread.lock objects. How could I process the s3 files parallely and read the body of the object.

Here is the doe snippet after editing.

def f(key):
    s3_client = boto3.client('s3')
    body = s3_client.get_object(Bucket='bucket', Key=key)['Body'].read()
    return body    
data_rdd = sc.parallelize(keys_list).map(lambda key: f(key))

@eliasahThat link dosent talk much about implementation rather it has info about what the other person did. It did not answer my question. — ZZzzZZzz, Jun 17 '17 at 19:49
If I may ask, what is it you are trying to read. Can you describe your use case with a small example ? I have this feeling that you want to read something the wrong way to start with. I mean, after all, s3 objects are mainly regular files. — eliasah, Jun 18 '17 at 09:49
@eliasah I was able to remove the pickle error. Now I could not convert the data_rdd to Data Frame or process the records. — ZZzzZZzz, Jun 18 '17 at 14:28
I have a csv file and I want to read multiple such files. I am fetching the keys from s3 and trying to parallelize the list of keys to my function which returns the body. Once I have the rdd ready, I then want to create a data frame. — ZZzzZZzz, Jun 18 '17 at 14:44
This sounds a little absurde. Spark parallelizes reads if you just give him the path of the files. How are your path files structured ? — eliasah, Jun 18 '17 at 15:07
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/147000/discussion-between-eliasah-and-zzz). — eliasah, Jun 18 '17 at 15:20

Pyspark read files from s3 and parallelize the file list

0 Answers0