0

What is the best way of reading a bucket with multiple sub-folders in S3? How to parallelize the reads?

Arun
  • 31
  • 2
  • if you don't need to download the files you can opt for using boto ( `client.get_object()`). If you know your bucket structure you can get some Celery workers doing the reads and returing the `StreamingBody` to your spark context or you can opt for connecting directly to the s3 URL like in [this question](https://stackoverflow.com/questions/32155617/connect-to-s3-data-from-pyspark) - I'm not giving you the solution, but some comments on how I would try to tackle one of the 3 questions you ask –  May 24 '17 at 01:27
  • Did you find any answer to this? – Viv Jun 20 '17 at 10:12

0 Answers0