1

I'm downloading csv files and processing the content by using Python 3.8.

I faced a memory error when downloading a large file, so, I need to download a certain amount of rows (let's say 10k rows), process and then read the next 10k rows until the entire csv is processed. So far, I read the entire csv and I decode it by converting it into a dictionary that preserves the headers and the values of each row:

    data = s3.get_object(Bucket=config.BUCKET_NAME, Key=source_file)
    contents = data['Body'].read().decode("utf-8")
    csv_reader = csv.DictReader(contents.splitlines(True))

I've been reading documentation and download_fileobj can read an object in chunks and uses a callback method to process it, but the object is divided in bytes, and I need to divide it in rows to not split a row in the middle.

I prefer to not download the entire file into disk because I don't have a lot of space and that will require to delete the file after processing, so I prefer some way to do it directly in RAM, by using a library, method etc.

Ideas?

eduardosufan
  • 1,441
  • 2
  • 23
  • 51
  • s3 doesn't understand rows. You will need to handle bytes in your code & re-process any rows split in the middle if you want to do multipart download. If you know how big each of your rows are (in bytes) this would be easier. – rdas Sep 20 '22 at 18:48
  • No, row size is variable. Yeah, with a fixed row size, I could use the download_fileobj method directly. – eduardosufan Sep 20 '22 at 18:50
  • What about this answer: https://stackoverflow.com/a/46435402? – ndclt Sep 20 '22 at 18:51
  • 2
    You might want to investigate using [smart-open · PyPI](https://pypi.org/project/smart-open/), which is a "drop-in replacement for Python’s built-in `open()` command". It takes care of all the hard stuff. So, just `import` the library, open the fil eand then read it with `DictReader()` as normal. See how you go! It can even read zip files without unzipping. – John Rotenstein Sep 20 '22 at 22:25

0 Answers0