What is the best way of reading a bucket with multiple sub-folders in S3? How to parallelize the reads?
Asked
Active
Viewed 179 times
0
-
if you don't need to download the files you can opt for using boto ( `client.get_object()`). If you know your bucket structure you can get some Celery workers doing the reads and returing the `StreamingBody` to your spark context or you can opt for connecting directly to the s3 URL like in [this question](https://stackoverflow.com/questions/32155617/connect-to-s3-data-from-pyspark) - I'm not giving you the solution, but some comments on how I would try to tackle one of the 3 questions you ask – May 24 '17 at 01:27
-
Did you find any answer to this? – Viv Jun 20 '17 at 10:12