How to read and process large csv objects from S3 using Python boto3?

Question

I'm downloading csv files and processing the content by using Python 3.8.

I faced a memory error when downloading a large file, so, I need to download a certain amount of rows (let's say 10k rows), process and then read the next 10k rows until the entire csv is processed. So far, I read the entire csv and I decode it by converting it into a dictionary that preserves the headers and the values of each row:

    data = s3.get_object(Bucket=config.BUCKET_NAME, Key=source_file)
    contents = data['Body'].read().decode("utf-8")
    csv_reader = csv.DictReader(contents.splitlines(True))

I've been reading documentation and download_fileobj can read an object in chunks and uses a callback method to process it, but the object is divided in bytes, and I need to divide it in rows to not split a row in the middle.

I prefer to not download the entire file into disk because I don't have a lot of space and that will require to delete the file after processing, so I prefer some way to do it directly in RAM, by using a library, method etc.

Ideas?

s3 doesn't understand rows. You will need to handle bytes in your code & re-process any rows split in the middle if you want to do multipart download. If you know how big each of your rows are (in bytes) this would be easier. — rdas, Sep 20 '22 at 18:48
No, row size is variable. Yeah, with a fixed row size, I could use the download_fileobj method directly. — eduardosufan, Sep 20 '22 at 18:50
What about this answer: https://stackoverflow.com/a/46435402? — ndclt, Sep 20 '22 at 18:51
You might want to investigate using [smart-open · PyPI](https://pypi.org/project/smart-open/), which is a "drop-in replacement for Python’s built-in `open()` command". It takes care of all the hard stuff. So, just `import` the library, open the fil eand then read it with `DictReader()` as normal. See how you go! It can even read zip files without unzipping. — John Rotenstein, Sep 20 '22 at 22:25

How to read and process large csv objects from S3 using Python boto3?

0 Answers0