0

I have a bunch of CSV files compressed as one zip on S3. I only need to process one CSV file inside the zip using AWS lambda function

import boto3
from zipfile import ZipFile

BUCKET = 'my-bucket'
s3_rsc = boto3.resource('s3')

def zip_stream(zip_f='app.zip', bkt=BUCKET, rsc=s3_rsc):
    obj = rsc.Object(
        bucket_name=bkt,
        key=zip_f
    )

    return ZipFile(BytesIO(obj.get()['Body'].read()))


zip_obj = zip_stream()
csv_dat = zip_obj.read('one.csv')

The above snippet works well with test zip files, however, it fails with memory error if the zip file size exceeds 0.5G.

Error Message

{ "errorMessage": "", "errorType": "MemoryError", "stackTrace": [ " File "/var/task/lambda_function.py", line 12, in handler\n all_files = files_in_zip()\n", " File "/var/task/lambda_function.py", line 36, in files_in_zip\n zippo = zip_stream()\n", " File "/var/task/lambda_function.py", line 32, in zip_stream\n return ZipFile(BytesIO(obj.get()['Body'].read()))\n", " File "/var/runtime/botocore/response.py", line 77, in read\n chunk = self._raw_stream.read(amt)\n", " File "/var/runtime/urllib3/response.py", line 515, in read\n data = self._fp.read() if not fp_closed else b""\n", " File "/var/lang/lib/python3.8/http/client.py", line 468, in read\n s = self._safe_read(self.length)\n", " File "/var/lang/lib/python3.8/http/client.py", line 609, in _safe_read\n data = self.fp.read(amt)\n" ] }

Is there an option to stream/lazyload the zipfile to mitigate memory issues?

Note - I also referred an old post(How can I use boto to stream a file out of Amazon S3 to Rackspace Cloudfiles?) which spoke about streaming a file but not zip

N Raghu
  • 706
  • 4
  • 13
  • 26
  • Also consider simply configuring the Lambda function with more RAM. – jarmod Jul 14 '21 at 17:28
  • You might consider using [smart-open](https://pypi.org/project/smart-open/) to wrap the work of streaming data from S3 as needed. – Anon Coward Jul 14 '21 at 18:24
  • @AnonCoward smart-open doesn't seem to wrap/stream a zip format files, I tried to wrap the io.BufferedReader(response['Body']) but couldn't succeed. Do you an example if smart-open can deal with zip format files which could help me – N Raghu Jul 15 '21 at 09:19
  • @jarmod You can increase the RAM of Lambda function only upto 10G which is a workaround, however, it appears like an expensive move for me. – N Raghu Jul 15 '21 at 09:22
  • It's not necessarily much more expensive. With more RAM you get correspondingly more CPU and network i/o, so your process may run much faster and hence you will be billed more per ms, but for a shorter duration in total. Maybe try [aws-lambda-power-tuning](https://github.com/alexcasalboni/aws-lambda-power-tuning) to get the best combination. – jarmod Jul 15 '21 at 13:26

1 Answers1

0

Depending on your exact needs, you can use smart-open to handle the reading of the zip File. If you can fit the CSV data in RAM in your Lambda, it's fairly straightforward to call directly:

from smart_open import smart_open
from io import TextIOWrapper, BytesIO

def lambda_handler(event, context):
    # Simple test, just calculate the sum of the first column of a CSV file in a Zip file
    total_sum, row_count = 0, 0
    # Use smart open to handle the byte range requests for us
    with smart_open("s3://example-bucket/many_csvs.zip", "rb") as f:
        # Wrap that in a zip file handler
        zip = zipfile.ZipFile(f)
        # Open a specific CSV file in the zip file
        zf = zip.open("data_101.csv")
        # Read all of the data into memory, and prepare a text IO wrapper to read it row by row
        text = TextIOWrapper(BytesIO(zf.read()))
        # And finally, use python's csv library to parse the csv format
        cr = csv.reader(text)
        # Skip the header row
        next(cr)
        # Just loop through each row and add the first column
        for row in cr:
            total_sum += int(row[0])
            row_count += 1

    # And output the results
    print(f"Sum {row_count} rows for col 0: {total_sum}")

I tested this with a 1gb zip file containing hundreds of CSV files. The CSV file I picked was around 12mb uncompressed, or 100,000 rows, so it felt nicely into RAM in the Lambda environment, even when limited to 128mb of RAM.

If your CSV file can't be loaded at once like this, you'll need to take care to load it in sections, buffering the reads so you don't waste time reading it line-by-line and forcing smart-open to load small chunks at a time.

Anon Coward
  • 9,784
  • 3
  • 26
  • 37