4

I'm trying to read a very big file from s3 using...

import pandas as pd
import s3fs
df = pd.read_csv('s3://bucket-name/filename', chunksize=100000)

But even after giving the chunk size it is taking for ever. Does the chunksize option work when fetching file from s3 ? If not is there any better way in loading big files from s3?

mbadawi23
  • 1,029
  • 2
  • 21
  • 43
Xion
  • 319
  • 2
  • 11
  • 1
    Does [this](https://stackoverflow.com/questions/55396938/how-to-read-only-5-records-from-s3-bucket-and-return-it-without-getting-all-data/55397464) help? – jellycsc Feb 25 '21 at 19:47
  • yeah. thinking Dask is a good option – Xion Feb 25 '21 at 20:17

1 Answers1

4

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html Clearly says that

filepath_or_bufferstr, path object or file-like object Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv.

If you want to pass in a path object, pandas accepts any os.PathLike.

By file-like object, we refer to objects with a read() method, such as a file handle (e.g. via builtin open function) or StringIO.

When reading in chunk, pandas return you iterator object, you need to iterate through it.. Something like:

for df in pd.read_csv('s3://<<bucket-name>>/<<filename>>',chunksize = 100000):
    process df chunk..

And if you think it's because the chunksize is large, you can consider trying it for the first chunk only for a small chunksize like this:

for df in pd.read_csv('s3://<<bucket-name>>/<<filename>>',chunksize = 1000):
    print(df.head())
    break
ThePyGuy
  • 17,779
  • 5
  • 18
  • 45
  • 2
    This answer helped me. I wrote a function whereby I know the Athena results S3 location as bucket and key. Then I process the massive Athena result csv by chunks: `def process_result_s3_chunks(bucket, key, chunksize): csv_obj = s3.get_object(Bucket=bucket, Key=key) body = csv_obj['Body'] for df in pd.read_csv(body, chunksize=chunksize): process(df) ` – nom-mon-ir Apr 22 '21 at 07:20