Amazon SQS message multi-delivery

Question

I understand that to bring vast scalability and reliability, SQS does extensive parallelization of resources. It uses redundant servers for even small queues and even the messages posted to the queues are stored redundantly as multiple copies. These are the factors which prevent it from exactly-once-delivery like in RabbitMQ. I have seen even deleted messages being delivered.

The implications for the developers is that they need to be prepared for multiple delivery of messages. Amazon claims it not to be a problem, but it it is, then the developer must use some synchronization construct like a database-transaction-lock or dynamo-db conditional write. both of these reduce scalability.

Question is,

In light of the duplicate delivery problem, how the message-invisible-period feature holds? The message is not guaranteed to be invisible. If the developer has to make own arrangements for synchronization, what benefit is of the invisibility-period. I have seen messages re-delivered even when they were supposed to be invisible.

Edit

here i include some references

I'm curious - I've done extensive work with SQS and never seen these problems. Not sure whether it's luck, or that the applications and enterprise systems I've built with it didn't matter if they picked up the same message. Do you have any references to documentation around this? — Pete - MSFT, Sep 02 '13 at 09:03
Some further suggested reading on [why duplication cannot be eliminated](http://stackoverflow.com/a/38290017/836214) — Krease, Jul 10 '16 at 17:31
SQS now has FIFO queues that guarantee exactly-once message delivery. See http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/FIFO-queues.html — Lophat, Nov 29 '16 at 22:29
@Lophat, the fifo is only available in US West and US East zone — danny, Mar 01 '17 at 06:14
Hi @inquisitive, do you have a rough estimation/feeling that how often a message duplication happen? something like once every a thousand message? the aws doc says this should be rare. — danny, Mar 01 '17 at 06:25
@danny, Yeh, once few thousands. To me it looks like a burst thing. Once it starts redelivering, it goes on doing so for a few seconds. I find clusters of such events in my logs. I have used dynamodb + strong consistency + conditional writes to work around. — inquisitive, Mar 01 '17 at 06:59

score 17 · Accepted Answer · answered Sep 02 '13 at 09:44

Message invisibility solves a different problem to guaranteeing one and only one delivery. Consider a long running operation on an item in the queue. If the processor craps out during the operation, you don't want to delete the message, you want it to reappear and be handled again by a different processor.

So the pattern is...

Write (push) item into queue
View (peek) item in queue
Mark item invisible
Execute process on item
Write results
Delete (pop) item from queue

So whether you get duplicate delivery or not, you still need to ensure that you process the item in the queue. If you delete it on pulling it off the queue, and then your server dies, you may lose that message forever. It enables aggressive scaling through the use of spot instances - and guarantees (using the above pattern), that you won't lose a message.

But - it doesn't guarantee once and only once delivery. But I don't think it's designed for that problem. I also don't think it's an insurmountable problem. In our case (and I can see why I've never noticed the issues before) - we're writing results to S3. It's no big deal if it overwrites the same file with the same data. Of course if it's a debit transaction going to a bank a/c, you'd probably want some sort of correlation ID... and most systems already have those in there. So if you get a duplicate correlation value, you throw an exception and move on.

Good question. Highlighted something for me.

so i get that message-invisibility is made to fail-safe worker crashes, not single-delivery. so `message-invisibility` is more like `timeout-and-requeue`, not `prevent-the-other-worker` from picking up the same task. because the other worker might get the duplicate anyway... ?? is my understanding correct? — inquisitive, Sep 02 '13 at 10:01
Kind of. Definitely `timeout-and-requeue`. But I'd suggest it's also `try-to-prevent-the-other-worker`. You don't want the other worker to grab it if possible, because if it did and you had a farm, they *all* grab the next item. In most cases, it will act like `prevent-the-other-worker`, but that's not guaranteed behaviour. — Pete - MSFT, Sep 02 '13 at 10:11

Amazon SQS message multi-delivery

Question is,

Edit

1 Answers1