8

We have been using AWS S3 notifications to trigger lambda functions when files land on S3 and this model has worked reasonably well until we noticed that some files are processed multiple times, generating duplicates in our datastore. We noticed that it happened for about 0.05% of our files.

I know can guard against this by performing an upsert, but what is of concern to us is the potential cost of running unnecessary lambda functions, as this impacts our cost.

I've searched Google and SO, but only found similar-ish issues. We are not having a timeout problem, as the files have been processed fully. Our files are rather small, with the biggest file being less than 400k. We are not receiving the same event twice, as the events have different request ids, even though they are running on the same file.

chaos
  • 641
  • 10
  • 21
  • Couple of questions: did you raise this with AWS Support in case an investigation yields anything useful? And are you 100% sure that the objects that caused multiple events were not uploaded multiple times? That would fit the symptom of different request IDs. Interesting 0.05% statistic, thanks for sharing that. The additional cost of any duplicate processing would seem to be quite low in that case so probably worth comparing to the additional cost of the orchestration that you'd have to build *not* using S3 triggers and Lambda to see if it makes sense. – jarmod Jun 26 '19 at 13:52
  • @jarm We didn't raise with AWS, as digging indicated that we had the wrong use case / design for this solution. We are sure the files were written only once, as we were versioning the files and our logs indicated the files were created only once. – chaos Jun 26 '19 at 16:20
  • Also, we expect to generate a few billions of files over the period of a year and this lambda is part of a a multi-stage process where the first file is split into 20+ files and at the end of the process and each file the processed by a different lambda and then loaded to a database for near real time analytics and reports, the duplication was carrying over. – chaos Jun 26 '19 at 16:27

4 Answers4

14

After wasting quite some time looking into S3, SNS and Lambda documentations, I've found a note on specific to S3 notification that reads:

If your application requires particular semantics (for example, ensuring that no events are missed, or that operations run only once), we recommend that you account for missed and duplicate events when designing your application.

https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html

Effectively this means that S3 notifications is the wrong solution for us, but considering the research time I've invested in this issue, I thought I'd contribute this here for anyone else who may have overlooked the page linked above.

chaos
  • 641
  • 10
  • 21
  • 2
    Are you absolutely sure the events are *perfect* duplicates with identical payload? Specifically `responseElements` and `sequencer`? The service does not assure perfect 1:1 but the ratio you report seems higher than I would expect, and I do have environments with zero documented instances of duplicates. Personally, I like to use S3 > SNS > Lambda even when SNS is not strictly needed, because then I can subscribe an SQS queue to the SNS topic and capture events in parallel in that queue, for separate analysis. – Michael - sqlbot Jun 26 '19 at 14:33
  • 1
    `responseElements['x-amz-request-id']` correlates to the `Request ID` column in the [S3 bucket logs](https://docs.aws.amazon.com/AmazonS3/latest/dev/LogFormat.html) and for all events corresponding to any one single object key, a lexical comparison of `sequencer` provides the order in which S3 believes those events occurred. – Michael - sqlbot Jun 26 '19 at 14:37
  • The events create a new row in our DB with the exact same values. In our POC, it was working flawlessly and we only noticed this once 2 months into running it. Our logs for those writes indicate that they happened only once and some of the duplicates have different timestamps in the notifications. We found different sns request ids for the same event and file that were a minute apart. I didn't check the s3 logs, but will try now. – chaos Jun 26 '19 at 16:34
  • Can use redis to implement a distributed-lock for this right ? Use S3 object key as the lock. Make sure that key is processed only once. – Ashika Umanga Umagiliya Dec 16 '20 at 06:48
1

If sequence number is same for duplicate events: As a workaround, you can consider to trigger notification to secondary database or maintain index of S3 objects using event notifications. Then, store and compare the sequencer key values to check for duplicates as each event notification is processed. I did additional research on how you can compare unique values from the event notification in Lambda function and found article[1] which might be helpful to achieve this. Additionally, please also have a look at external article[2], [3] for sample codes for reference and ensure to test this in your development environment before implementing in production.

References:

[1] https://aws.amazon.com/premiumsupport/knowledge-center/lambda-function-idempotent/

[2] https://cloudonaut.io/your-lambda-function-might-execute-twice-deal-with-it/

[3] https://adrianhesketh.com/2020/11/27/idempotency-and-once-only-processing-in-lambda-part-1

  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-ask). – Community Sep 17 '21 at 17:23
  • Thank you Nitu, I'm marking this answer as the correct as the idempotency link was what solved the issue for us ages ago. Also kudos to @Michael - sqlbot as his mention to sequencer all those years ago has helped put us in the right direction. – chaos Sep 14 '22 at 17:14
0

If the sequence key doesn’t match between the events then the export process is uploading the same object multiple times and triggering the event notification with different sequence key. In this case, the events are not considered as duplicate events and invokes the Lambda function whenever the object is uploaded. This is expected behavior.

If the sequence key does match between the events, then the export process is uploading the object once however Amazon S3 generates duplicate events and maps the events with same sequence key resulting in multiple Lambda invocation. This is rare condition which happens due to retry nature of Amazon S3 service and the workaround is to store and compare the sequencer key values to check for duplicates as each event notification is processed.

Rome_Leader
  • 2,518
  • 9
  • 42
  • 73
  • Hi Nitu, thank you for your answer - it is indeed a rare condition, but it happens every tens of thousands of events, which in our case meant it happened everyday, multiple times a day. – chaos Sep 14 '22 at 17:07
0

We resolved that issue limiting Lambda Function concurrency to 1

Igor S
  • 1
  • 1
    This answer may solve some issues, but it has definitely a lot of side effects, scalability would be harmed. Specially with sync invocations, where the calls would be throttled(https://docs.aws.amazon.com/lambda/latest/dg/invocation-scaling.html#throttling-behavior). – Nivardo Albuquerque Sep 13 '22 at 14:03
  • As Nivardo mentioned, scalability would be a problem - During peak times, the lambdas were already running a concurrency level of thousands, limiting it to 1 would mean a huge build up of events to be processed. – chaos Sep 14 '22 at 17:01