How to read a s3 bucket with 100GB of parquet files efficiently in pyspark?

Asked May 24 '17 at 01:01

Active May 24 '17 at 01:01

Viewed 179 times

What is the best way of reading a bucket with multiple sub-folders in S3? How to parallelize the reads?

asked May 24 '17 at 01:01

Arun

if you don't need to download the files you can opt for using boto ( `client.get_object()`). If you know your bucket structure you can get some Celery workers doing the reads and returing the `StreamingBody` to your spark context or you can opt for connecting directly to the s3 URL like in [this question](https://stackoverflow.com/questions/32155617/connect-to-s3-data-from-pyspark) - I'm not giving you the solution, but some comments on how I would try to tackle one of the 3 questions you ask – May 24 '17 at 01:27
Did you find any answer to this? – Viv Jun 20 '17 at 10:12

0 Answers0