Right now I have a cron job that runs once a day. It pipes a curl command into a file, gzips that file, then uploads it to an s3 bucket. I'd like to move this off of my server and into aws tooling. What's the recommended way to do this currently? Make a lambda function and schedule it to run daily?
1 Answers
The most cost effective option would be the one you describe :
create a lambda function that download your content, zip it and upload to S3. Lambda functions have access to the host's file system (500 Mb in
/tmp
) and do not forget to delete the file afterwards. The container will be reused (in your account)schedule a CloudWatch event to trigger the lambda function at regular time interval.
configure the lambda function to authorise CloudWatch Event to invoke your function
aws lambda add-permission --function-name my-function\
--action 'lambda:InvokeFunction' --principal events.amazonaws.com
--statement-id events-access \
--source-arn arn:aws:events:*:123456789012:rule/*
[UPDATE] : what if the file to download is 4Gb ?
In that case, you'll have two options. One with more work but more cost effective. One easier to implement but that might cost a bit more.
Option 1 : full serverless
You can design your AWS Lambda function to download the 4GB content and stream it to S3 by 5 Mb chuncks and compress chunck by chunck. I am not a compression expert, but I am sure it must be possible to find a library handling that for you. The downside is that you need to write specific code, it will not be as easy as combining the AWS CLI and GZIP command line tool.
Option 2 : start an EC2 instance for the duration of the job
The scheduled Lambda function can use EC2's API to start an instance. The job script can be passed to the instance using userdata
(a script the instance will execute at boot time). That script can call TerminateInstance
when the job is done to kill itself and stop being charged for it.
The downside is that you will have to pay for the time this instance is running (you can have 750h/month for free of t2.micro
instances)
The positive is that you can use standard command line tools such as AWS CLI and GZIP and you will have plenty of local storage for your task.
Here is how to start an instance from Python : https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ec2.html#EC2.Client.start_instances

- 14,301
- 5
- 41
- 64
-
What if the webpage is giving me back 4gb worth of data? – user433342 Apr 25 '19 at 21:42
-
That's the limit of Lambda. I will update the answer with options – Sébastien Stormacq Apr 26 '19 at 08:21
-
1Yeah took a while to get the streaming working but am streaming the download to gzip which is writing to the tmp folder, then I upload from the tmp folder to s3. Thanks – user433342 Apr 28 '19 at 20:06