87

What is the best way to prevent duplicate messages in Amazon SQS? I have a SQS of domains waiting to be crawled. before I add a new domain to the SQS I can check with the saved data to see if it has been crawled recently, to prevent duplicates.

The problem is with the domains that have not been crawled yet. For example if there is 1000 domains in the queue that have not been crawled. Any of those links could be added again, and again and again. Which swells my SQS to hundreds of thousands of messages that is mostly duplicates.

How do I prevent this? Is there a way to remove all duplicates from a queue? Or is there a way to search a queue for a message before I add it? I feel this is a problem that anyone with a SQS must have experienced.

One option that I can see is if I store some data before the domain is added to the SQS. But if I have to store the data twice, that kinda ruins the point of using a SQS in the first place.

Marcus Lind
  • 10,374
  • 7
  • 58
  • 112
  • Possible duplicate of [Using many consumers in SQS Queue](http://stackoverflow.com/questions/37472129/using-many-consumers-in-sqs-queue) – Krease Jul 10 '16 at 17:29
  • 7
    AWS now offers [fifo queues](http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/FIFO-queues.html), which provide "exactly-once processing but are limited to 300 transactions per second". – bishop Sep 07 '17 at 14:05
  • 1
    @bishop yes, FIFO queues allow this now. But this duplication can be detected within the deduplication interval. https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-sqs-queues.html#cfn-sqs-queue-contentbaseddeduplication – Keet Sugathadasa Apr 19 '18 at 10:14
  • Does SqsMessageDeletionPolicy.ON_SUCCESS solve the duplicate message issue? If so, this would avoid the overhead of using persistent layer and checking for the processed messages. Please share your experience. – Prasanna Balaraman Aug 02 '21 at 15:58

6 Answers6

90

As the other answers mentioned, you can't prevent duplicate messages coming through from SQS.

Most of the time your messages will be handed to one of your consumers once, but you will run into duplicates at some stage.

I don't think there is an easy answer to this question, because it entails coming up with a proper architecture that can cope with duplicates, meaning it's idempotent in nature.

If all the workers in your distributed architecture were idempotent, it would be easy, because you wouldn't need to worry about duplicates. But in reality, that sort of environment does not exist, somewhere along the way something will not be able to handle it.

I am currently working on a project where it's required of me to solve this, and come up with an approach to handle it. I thought it might benefit others to share my thinking here. And it might be a good place to get some feedback on my thinking.

Fact store

It's a pretty good idea to develop services so that they collect facts which can theoretically be replayed to reproduce the same state in all the affected downstream systems.

For example, let's say you are building a message broker for a stock trading platform. (I have actually worked on a project like this before, it was horrible, but also a good learning experience.)

Now let's say that that trades come in, and there are 3 systems interested in it:

  1. An old school mainframe which needs to stay updated
  2. A system that collates all the trades and share it with partners on a FTP server
  3. The service that records the trade, and reallocates shares to the new owner

It's a bit convoluted, I know, but the idea is that one message (fact) coming in, has various distributed downstream effects.

Now let's imagine that we maintain a fact store, a recording of all the trades coming into our broker. And that all 3 downstream service owners calls us to tell us that they have lost all of their data from the last 3 days. The FTP download is 3 days behind, the mainframe is 3 days behind, and all the trades are 3 days behind.

Because we have the fact store, we could theoretically replay all these messages from a certain time to a certain time. In our example that would be from 3 days ago until now. And the downstream services could be caught up.

This example might seem a bit over the top, but I'm trying to convey something very particular: the facts are the important things to keep track of, because that's where we are going to use in our architecture to battle duplicates.

How the Fact store helps us with duplicate messages

Provided you implement your fact store on a persistence tier that gives you the CA parts of the CAP theorem, consistency and availability, you can do the following:

As soon as a message is received from a queue, you check in your fact store whether you've already seen this message before, and if you have, whether it's locked at the moment, and in a pending state. In my case, I will be using MongoDB to implement my fact store, as I am very comfortable with it, but various other DB technologies should be able to handle this.

If the fact does not exist yet, it gets inserted into the fact store, with a pending state, and a lock expiration time. This should be done using atomic operations, because you do not want this to happen twice! This is where you ensure your service's idempotence.

Happy case - happens most of the time

When the Fact store comes back to your service telling it that the fact did not exist, and that a lock was created, the service attempts to do it's work. Once it's done, it deletes the SQS message, and marks the fact as completed.

Duplicate message

So that's what happens when a message comes through and it's not a duplicate. But let's look at when a duplicate message comes in. The service picks it up, and asks the fact store to record it with a lock. The fact store tells it that it already exists, and that it's locked. The service ignores the message and skips over it! Once the message processing is done, by the other worker, it will delete this message from the queue, and we won't see it again.

Disaster case - happens rarely

So what happens when a service records the fact for the first time in the store, then get a lock for a certain period, but falls over? Well SQS will present a message to you again, if it was picked up, but not deleted within a certain period after it was served from the queue. Thats why we code up our fact store such that a service maintains a lock for a limited time. Because if it falls over, we want SQS to present the message to the service, or another instance thereof at a later time, allowing that service to assume that the fact should be incorporated into state (executed) again.

hendrikswan
  • 2,263
  • 1
  • 20
  • 25
  • 4
    Awesome answer! Slight nitpick: I would say that in the **Happy Case** you should mark the fact as completed, _and then_ delete the SQS message. I would then also suggest updating the **Duplicate message** case to delete a message if the fact is already marked as completed (not wait for the original handler to do it). – Matt Klein Feb 01 '18 at 16:27
  • Hendrick, could you shed some more light into the disaster case, especially the lock time (i.e. how long should that lock time be and how long SQS waits before re-present the message to your service?) – Govind Rai Nov 19 '18 at 15:44
  • 9
    This is no longer valid. FifoQueue and DeduplicationId is now possible to collapse dupe messages for 5-min intervals. See @ayman eltabakh below. – Sean Jan 05 '20 at 16:54
  • There is still a race condition that needs to be catered for. 2 or more processes have the same message from SQS and all are checking the fact store at the same time or checking as the others are updating. I guess a unique constraint can be used in an RDMS to handle this, how is that managed in MongoDB? – Paul Barclay May 18 '21 at 15:11
  • @PaulBarclay In mongo db, you could create a custom `_id` which can be set to some hash generated from our message ID. This would handle the unique constraint. – vighnesh153 Jan 29 '23 at 13:27
25

Amazon SQS Introduces FIFO Queues with Exactly-Once Processing and Lower Prices for Standard Queues

Using the Amazon SQS Message Deduplication ID The message deduplication ID is the token used for deduplication of sent messages. If a message with a particular message deduplication ID is sent successfully, any messages sent with the same message deduplication ID are accepted successfully but aren't delivered during the 5-minute deduplication interval.

Amazon SQS Introduces FIFO Queues

Using the Amazon SQS Message Deduplication ID

ayman eltabbakh
  • 251
  • 3
  • 4
  • 1
    Hello ayman, if i got this right, the deplucationId only prevents that a message can be duplicated. But what if one message gets caught up by two workers in parallel? I guess that is what @markus-lind`s question was about. – Max Schindler Apr 15 '20 at 12:44
  • 2
    @MaxSchindler Hello Max, FIFO queues provide additional features that help prevent unintentional duplicates from being sent by message producers or from being received by message consumers. FIFO queues are being introduced after 2 years of markus-lind's question. – ayman eltabbakh Apr 16 '20 at 16:46
  • 3
    dedup works over a 5-minute interval only. – Brian Fitzgerald May 13 '21 at 13:15
9

According to AWS Docs, Exactly-Once Processing provides a way to avoid duplicate messages.

Unlike standard queues, FIFO queues don't introduce duplicate messages. FIFO queues help you avoid sending duplicates to a queue. If you retry the SendMessage action within the 5-minute deduplication interval, Amazon SQS doesn't introduce any duplicates into the queue.

If your queue is an FIFO queue and has enabled content based duplication, this function can be utilized to avoid duplicate messages during the deduplication interval. For more, read this section, and the below link.

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-sqs-queues.html#cfn-sqs-queue-contentbaseddeduplication

Keet Sugathadasa
  • 11,595
  • 6
  • 65
  • 80
7

There is no API level way of preventing duplicate messages to be posted to a SQS queue. You would need to handle this at application level I am afraid.

You can use a DynamoDB table to store your Domain Names waiting to be crawled and only add them to the queue if they are not in DynamoDB for example.

Sébastien Stormacq
  • 14,301
  • 5
  • 41
  • 64
  • 9
    But if I do that, why even use SQS at all? Why not just let the application read straight from the DynamoDB? Maybe I'm missunderstanding the use of SQS but if I still need to store all data in a database, I feel like SQS lose its value and point. To me the reason I want to use a SQS is to NOT need to write data to a Database. – Marcus Lind Apr 24 '14 at 09:26
  • 6
    this is an architecture decision. SQS (or any queuing system) is great at allowing asynchronous communications between applications and at having multiple message consumers consuming messages from multiple producers. Example would be between a web tier and a fleet of batch workers. Database are not designed at these type of communications and would require additional work. But DB are good at sharing state between independant workers or apps. In your use case, maybe a DB would be enough. – Sébastien Stormacq Apr 24 '14 at 09:32
  • I agree with @MarcusLind. SQS is simply a queue (buffer), which holds data for processing. Handling duplicates from the sending end or receiving end, would make it much easier in terms of the architecture. – Keet Sugathadasa Apr 19 '18 at 10:17
  • SQS now supports FIFO and exactly one delivery https://aws.amazon.com/about-aws/whats-new/2016/11/amazon-sqs-introduces-fifo-queues-with-exactly-once-processing-and-lower-prices-for-standard-queues/ – Sébastien Stormacq May 02 '18 at 17:41
  • DynamoDB is NOT a queue, so please do not use it as such. A queue means adding elements very fast to process them later. If you do that in dynamo you will need to scan the whole table to get the messages and that is an expensive operation. SQS supports now FIFO queues that can deal with exactly-once delivery and keeps order too. – Juan Antonio Gomez Moriano Jan 26 '20 at 06:18
  • Thank you Juan. I was not sugesting to use DynamoDB as a queue, in 2014 I proposed to use DynamoDB to complement SQS functionality, not to replace it. As you mentioned (and I mentioned too in 2018) , AWS SQS evolved since 2014 and other solutions are available today. – Sébastien Stormacq Jan 27 '20 at 11:46
  • @JuanAntonioGomezMoriano Note that the "exactly once delivery" is a bit misleading, as the de-duplication timespan is only 5 minutes, so if the same message comes through with 6 minutes of separation, the duplicate will indeed be delivered twice. – Brian Webster Apr 13 '22 at 14:33
2

Only if AWS had not brought in the update to detect duplicate messages, I'd have done something like this.. Just thinking aloud, Push everything to Redis with a TTL of let's say 15 mins (or whatever period that's suitable for the use case) as soon as the request reaches your gateway/your environment and make your producer check Redis for any message coming from user with same amount (whatever logic makes it duplicate as per the business logic) before sending/putting the message in the queue, if message exists then return the response back to client stating duplicate found, if not allow it to be pushed to the queue. Redis needs to be really HA.

Kid101
  • 1,440
  • 11
  • 25
  • 2
    I'm also using this solution, but I'm not happy about having to add a potential-failure point like Redis. You're right, the Redis needs to be HA which. – Brian Webster Apr 13 '22 at 14:34
  • better consider DynamoDb with time to live turned on and filter to not take post ttl period while ttl is not instant but can take hours to react. no need to run overprovisioned redis instances in such setup. – Lukas Liesis Jul 27 '22 at 20:17
0

You can use the VisibilityTimeout parameter:

var params = { 
    VisibilityTimeout: 20,
    ...
};

sqs.receiveMessage(params, function(err, data) {});

reference: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html

VisibilityTimeout — (Integer)
The duration (in seconds) that the received messages are hidden from subsequent retrieve requests after being retrieved by a ReceiveMessage request.

Renny Ren
  • 77
  • 1
  • 4
  • 2
    Hi, Visibility timeout just avoid the locked message being unlocked in the SQS queue after the timeout happens. But still the same message can be in the queue. – HernanFila Apr 13 '21 at 22:14
  • 1
    it has no effect in this use-case just adds some delay/lag basically – Lukas Liesis Jul 27 '22 at 20:15