0

I'm working on a project to receive emails via AWS SES that contain 7Zip archives which I want to extract with lambda and load back into s3. I have the attachment extraction part finished but py7zr will not extract any 7zip archive I provide in lambda. On my local machine I have tested 3-4 different 7Zip archives that work but on AWS lambda none of them have successfully extracted so far. Every single extraction gives me a CRCError like this:

raise CrcError(crc32, f.crc32, f.filename)
py7zr.exceptions.CrcError: (2247949274, 3091406592, 'a.csv')

It appears to me that py7zr is expecting the decompressed file to be smaller than it actually is. I have prepared the py7zr package via docker to be in its own lambda layer so I'm wondering if that may cause an issue? Is it possible for py7zr to be behaving differently on lambda and if so what might be the cause? Could Lambda be using something different than my local to decompress?

For this example I've put together a small test 7zip file that contains two csvs a.csv and b.csv, these both contain only two rows in them (header, and one data row). My goal is to download this file to AWS lambda from S3, extract it, and reupload those files to s3.

Here is a snippet from my lambda where the error occurs:

data = s3.get_object(Bucket='test', Key='test.7z')
contents = data['Body'].read()
try:
    with py7zr.SevenZipFile(BytesIO(contents), mode='r') as z:
        for filename in z.getnames():
            extracted_file = z.read(filename)
            s3.put_object(Bucket=event['Records'][0]['s3']['bucket']['name'], Key='processed'+filename, Body=extracted_file)

I have tried pretty much every variation of py7zr extract,extractall,read,readall and using refresh after an operation and none of them solve the CRCerror problem. I know the archive is being handled appropriately up until the extraction because I can see all the files within. I have also tried downloading the archive to tmp and passing file name into py7zr and that didn't work either.

I have debugged this on my local machine (where any of the py7zr commands work) and the file crc and expected crc are the same, so something must be different on lambda.

Has anyone else ran into this problem? I've seen other questions telling viewers to use py7zr with lambda but has anyone gotten it to work?

small update: Using try/except I managed to get one file out of the zip that seems to be an amalgamation of both a.csv and b.csv concatenated into one file. It looks like py7zr might not know where one file ends and the other begins so all extractions are merged into one file?

  • 1
    Could it be that py7zr is attempting to create a file somewhere other than `/tmp/`? Perhaps you could try unzipping to files to disk first, rather than extracting the contents to memory (which might cause a temp file to be created). For example: `archive.extractall(path="/tmp")` Then, you could upload the files from disk. You should delete the files in `/tmp/` after uploading, but don't worry about that for now -- it's just a test to see whether it fixes your problem. Or perhaps it's the `BytesIO()` bit that causes it to create a temp file, so you could try downloading the whole zip first. – John Rotenstein Jun 06 '23 at 00:49

0 Answers0