Automated Real Time Data Processing on AWS with Lambda

Question

I am interested in doing automated real-time data processing on AWS using Lambda and I am not certain about how I can trigger my Lambda function. My data processing code involves taking multiple files and concatenating them into a single data frame after performing calculations on each file. Since files are uploaded simultaneously onto S3 and files are dependent on each other, I would like the Lambda to be only triggered when all files are uploaded.

Current Approaches/Attempts:

-I am considering an S3 trigger, but my concern is that an S3 Trigger will result in an error in the case where a single file upload triggers the Lambda to start. An alternate option would be adding a wait time but that is not preferred to limit the computation resources used.

-A scheduled trigger using Cloudwatch/EventBridge, but this would not be real-time processing.

-SNS trigger, but I am not certain if the message can be automated without knowing the completion in file uploads.

Any suggestion is appreciated! Thank you!

I have added an answer but that needs two lambda or some system modification. — Dev Utkarsh, Jun 24 '21 at 18:36
How many files are you concatenating? What's the range of possible amount of files? And how often are new files added? — Yann Stoneman, Apr 03 '22 at 20:00

score 0 · Answer 1 · answered Jun 24 '21 at 18:25

If you really cannot do it with a scheduled function, the best option is to trigger a Lambda function when an object is created.

The tricky bit is that it will fire your function on each object upload. So you either can identify the "last part", e.g., based on some meta data, or you will need to store and track the state of all uploads, e.g. in a DynamoDB, and do the actual processing only when a batch is complete.

Best, Stefan

Dev Utkarsh · Answer 2 · 2021-06-24T18:38:39.683

Your file coming in parts might be named as -

filename_part1.ext
filename_part2.ext

If any of your systems is generating those files, then use the system to generate a final dummy blank file name as -

filename.final

Since in your S3 event trigger you can use a suffix to generate an event, use .final extension to invoke lambda, and process records.

In an alternative approach, if you do not have access to the server putting objects to your s3 bucket, then with each PUT operation in your s3 bucket, invoke the lambda and insert an entry in dynamoDB. You need to put a unique entry per file (not file parts) in dynamo with -

filename and last_part_recieved_time

The last_part_recieved_time keeps getting updated till you keep getting the file parts.

Now, this table can be looked up by a cron lambda invocation which checks if the time skew (time difference between SYSTIME of lambda invocation and dynamoDB entry - last_part_recieved_time) is enough to process the records.

I will still prefer to go with the first approach as the second one still has a chance for error.

Yann Stoneman · Answer 3 · 2022-04-03T20:13:43.170

Since you want this to be as real time as possible, perhaps you could just perform your logic every single time a file is uploaded, updating the version of the output as new files are added, and iterating through an S3 prefix per grouping of files, like in this other SO answer.

In terms of the architecture, you could add in an SQS queue or two to make this more resilient. An S3 Put Event can trigger an SQS message, which can trigger a Lambda function, and you can have error handling logic in the Lambda function that puts that event in a secondary queue with a visibility timeout (sort of like a backoff strategy) or back in the same queue for retries.

Automated Real Time Data Processing on AWS with Lambda

3 Answers3