1

I have an S3 bucket with a bunch of zip files. I want to decompress the zip files and for each decompressed item, I want to create an $file.gz and save it to another S3 bucket. I was thinking of creating a Glue job for it but I don't know how to begin with. Any leads?

Eventually, I would like to terraform my solution and it should be triggered whenever there are new files in the S3 bucket,

Would a Lambda function or any other service be more suited for this?

x89
  • 2,798
  • 5
  • 46
  • 110
  • Does this need to be triggered by anything? Or do you just need to decompress files in `x` bucket, create a gzipped package and move to bucket `y` when you "click on a button"? – Ermiya Eskandary Oct 15 '21 at 14:27
  • well, I would eventually terraform it and it should be triggered whenever there are new files in the S3 bucket @ErmiyaEskandary – x89 Oct 15 '21 at 14:28
  • That massively changes things! :) add it into your question please:) – Ermiya Eskandary Oct 15 '21 at 14:30

1 Answers1

1

From an architectural point of view, it depends on the file size of your ZIP files - if the process takes less than 15 minutes, then you can use Lambda functions.

If more, you will hit the current 15 minute Lambda timeout so you'll need to go ahead with a different solution.

However, for your use case of triggering on new files, S3 triggers will allow you to trigger a Lambda function when there are files created/deleted from the bucket.

I would recommend to segregate the ZIP files into their own bucket otherwise you'll also be paying for checking to see if any file uploaded is in your specific "folder" as the Lambda will be triggered for the entire bucket (it'll be negligible but still worth pointing out). If segregated, you'll know that any file uploaded is a ZIP file.

Your Lambda can then download the file from S3 using download_file (example provided by Boto3 documentation), unzip it using zipfile & eventually GZIP compress the file using gzip.

You can then upload the output file to the new bucket using upload_object(example provided by Boto3 documentation) & then delete the original file from the original bucket using delete_object.

Terraforming the above should also be relatively simple as you'll mostly be using the aws_lambda_function & aws_s3_bucket resources.

Make sure your Lambda has the correct execution role with the appropriate IAM policies to access both S3 buckets & you should be good to go.

Ermiya Eskandary
  • 15,323
  • 3
  • 31
  • 44
  • I think the number of files might be a bit large and hence, I should not use lambda. Can I use the same features (download_file, zipfele, gzip etc) within a glue job? and then run that glue job using a lambda function triggered by an S3 trigger? – x89 Oct 17 '21 at 12:29
  • Number of files or file size? Do you have an approx figure? – Ermiya Eskandary Oct 17 '21 at 12:29
  • each zipped folder contains 26 files of 1086 bytes each. At one time, I will only have to unzip one file I think. – x89 Oct 17 '21 at 12:37
  • bytes? you'll be fine with Lambda, I thought you're trying to decompress and compress 5GB ZIPs - don't worry about it, 1000 is also fine in one go (max soft concurrency limit is 1000 per region) but then for the triggered files, you'll be fine; feel free to request a limit increase via AWS for prod – Ermiya Eskandary Oct 17 '21 at 12:39
  • Once downloaded ```s3.download_file('testunzipping','DataPump_10000838.zip','/tmp/DataPump_10000838.zip')```, how can I use zipfile to unzip it? Can I assign this to another variable? I don't see an unzip function in the link you suggested – x89 Oct 17 '21 at 16:50
  • https://stackoverflow.com/a/3451150/4800344 – Ermiya Eskandary Oct 17 '21 at 16:52
  • so I extract them within ```tmp``` and give ```/tmp/DataPump_10000838.zip``` as the path_to_zip_file? – x89 Oct 17 '21 at 16:54
  • Yes correct, the temporary file system (/tmp) is the only directory a Lambda has access to - you have 512 MB of storage available – Ermiya Eskandary Oct 17 '21 at 16:55
  • @x89 done but please note comments are not for extended discussion :) I keep an eye on questions ;) – Ermiya Eskandary Oct 25 '21 at 10:32