I have a bunch of CSV files compressed as one zip on S3. I only need to process one CSV file inside the zip using AWS lambda function
import boto3
from zipfile import ZipFile
BUCKET = 'my-bucket'
s3_rsc = boto3.resource('s3')
def zip_stream(zip_f='app.zip', bkt=BUCKET, rsc=s3_rsc):
obj = rsc.Object(
bucket_name=bkt,
key=zip_f
)
return ZipFile(BytesIO(obj.get()['Body'].read()))
zip_obj = zip_stream()
csv_dat = zip_obj.read('one.csv')
The above snippet works well with test zip files, however, it fails with memory error if the zip file size exceeds 0.5G.
Error Message
{ "errorMessage": "", "errorType": "MemoryError", "stackTrace": [ " File "/var/task/lambda_function.py", line 12, in handler\n all_files = files_in_zip()\n", " File "/var/task/lambda_function.py", line 36, in files_in_zip\n zippo = zip_stream()\n", " File "/var/task/lambda_function.py", line 32, in zip_stream\n return ZipFile(BytesIO(obj.get()['Body'].read()))\n", " File "/var/runtime/botocore/response.py", line 77, in read\n chunk = self._raw_stream.read(amt)\n", " File "/var/runtime/urllib3/response.py", line 515, in read\n data = self._fp.read() if not fp_closed else b""\n", " File "/var/lang/lib/python3.8/http/client.py", line 468, in read\n s = self._safe_read(self.length)\n", " File "/var/lang/lib/python3.8/http/client.py", line 609, in _safe_read\n data = self.fp.read(amt)\n" ] }
Is there an option to stream
/lazyload the zipfile to mitigate memory issues?
Note - I also referred an old post(How can I use boto to stream a file out of Amazon S3 to Rackspace Cloudfiles?) which spoke about streaming a file but not zip