Random sample reading from AWS S3 via Pandas?

Asked Apr 26 '16 at 20:52

Active Apr 26 '16 at 20:52

Viewed 408 times

This answer got me to read a file on S3 but I have some huge files (CSV/txt) out there on S3 that I need to do random sampling on to get a manageable size for local processing. Since this does a full object read it would blow me out. Further I would like to be able to read sequentially to be able to select certain records by field content..

Any ideas?

edited May 23 '17 at 12:07

Community

asked Apr 26 '16 at 20:52

dartdog

10,432
21
72
121

Unless S3 provides you a way to read random parts of files (you'll have to check what's available in the library you're using) you're SOL. And note that the read speeds over the internet are much lower than you'll get locally, so performance will likely be glacial. – jonrsharpe Apr 26 '16 at 20:56
I get that but I believe (hope) that s3 exposes a sequential read method (not real sure though) hence the question,, for some things speed may not be too much of an issue. – dartdog Apr 26 '16 at 20:59
was hoping to find some way of using something like>>> open_file_chunk_reader(filename, start_byte, size, callback) from boto3??? – dartdog Apr 27 '16 at 01:27

Random sample reading from AWS S3 via Pandas?

0 Answers0