1

I am a newbie to Apache Spark and Pyspark. I have a use case where I have to read multiple files from different folders in s3 and then process the file contents for processing parallely. I have tried various ways and one of which is this way. I did not understand how to initialize s3 client inside the lambda body. I have been experiencing the same issue TypeError: can't pickle thread.lock objects. How could I process the s3 files parallely and read the body of the object.

Here is the doe snippet after editing.

def f(key):
    s3_client = boto3.client('s3')
    body = s3_client.get_object(Bucket='bucket', Key=key)['Body'].read()
    return body    
data_rdd = sc.parallelize(keys_list).map(lambda key: f(key))
Tonechas
  • 13,398
  • 16
  • 46
  • 80
ZZzzZZzz
  • 1,800
  • 3
  • 29
  • 51
  • @eliasahThat link dosent talk much about implementation rather it has info about what the other person did. It did not answer my question. – ZZzzZZzz Jun 17 '17 at 19:49
  • If I may ask, what is it you are trying to read. Can you describe your use case with a small example ? I have this feeling that you want to read something the wrong way to start with. I mean, after all, s3 objects are mainly regular files. – eliasah Jun 18 '17 at 09:49
  • @eliasah I was able to remove the pickle error. Now I could not convert the data_rdd to Data Frame or process the records. – ZZzzZZzz Jun 18 '17 at 14:28
  • You still haven't answered my question – eliasah Jun 18 '17 at 14:40
  • I have a csv file and I want to read multiple such files. I am fetching the keys from s3 and trying to parallelize the list of keys to my function which returns the body. Once I have the rdd ready, I then want to create a data frame. – ZZzzZZzz Jun 18 '17 at 14:44
  • This sounds a little absurde. Spark parallelizes reads if you just give him the path of the files. How are your path files structured ? – eliasah Jun 18 '17 at 15:07
  • They are csv files with line by line data. – ZZzzZZzz Jun 18 '17 at 15:07
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/147000/discussion-between-eliasah-and-zzz). – eliasah Jun 18 '17 at 15:20

0 Answers0