8

I'm trying to compress the all files keeping the same directory structure that are in the directory on the S3 bucket and put that zip on the S3 bucket.

Unpacking a zip file from the S3 bucket to S3 bucket is quite easy with BytesIO and zipfile, but I'm not sure how to do this with a directory containing a hundred files.

I found this link helpful but the post is for Lambda Node. Create a zip file on S3 from files on S3 using Lambda Node

Quijote
  • 113
  • 1
  • 2
  • 6

1 Answers1

2

To avoid downloading the individual objects onto disk, you'll need to stream the objects for each prefix (remember: S3 uses hierarchies), save the zip locally, upload it to S3, then delete. Here's the code I would use (and tested successfully in AWS):

import boto3
import io
import zipfile
import os

s3 = boto3.client('s3')

def zip_files(bucket_name, prefix):
    # List all objects in the bucket with the specified prefix
    response = s3.list_objects_v2(Bucket=bucket_name, Prefix=prefix)


    # Create a BytesIO object to store the compressed data
    zip_buffer = io.BytesIO()

    for obj in response.get('Contents', []):
        s3_object = s3.get_object(Bucket=bucket_name, Key=obj['Key'])

        # Use the ZipFile module to write the contents of the S3 object to the zip stream
        with zipfile.ZipFile(zip_buffer, 'w') as zip_file:
            # Write the contents of the S3 object to the zip file
            zip_file.writestr(obj['Key'], s3_object['Body'].read())

        # Save the zip file to disk
        with open(f'{prefix.rstrip("/")}.zip', 'wb') as f:
            f.write(zip_buffer.getvalue())

    # Upload the compressed data to the S3 bucket and delete
    zip_buffer.seek(0)
    s3.put_object(Bucket=bucket_name, Key=f'{prefix}{prefix.rstrip("/")}.zip', Body=zip_buffer)
    os.remove(f'{prefix.rstrip("/")}.zip')

bucket = 'foobucket'
folders = ['foo/', 'bar/', 'baz/']
for folder in folders:
    zip_files(bucket, folder)

You haven't provided any Python code to show that you're encountering the same memory limit as described in the Lambda Node you linked, so I'm assuming this is not a huge concern. Either way, the os.remove should keep things lightweight as the process continues.

Also: if you're running this logic within a Lambda function, you'll have to tweak it to fit the formatting required by Lambda.

Obviously, add logging and error handling to your needs.

Hope this helps!