92

I'm using Amazon SQS queues in a very simple way. Usually, messages are written and immediately visible and read. Occasionally, a message is written, and remains In-Flight(Not Visible) on the queue for several minutes. I can see it from the console. Receive-message-wait time is 0, and Default Visibility is 5 seconds. It will remain that way for several minutes, or until a new message gets written that somehow releases it. A few seconds delay is ok, but more than 60 seconds is not ok.

There a 8 reader threads that are long polling always, so its not that something is not trying to read it, they are.

Edit : To be clear, none of the consumer reads are returning any messages at all and it happens regardless of whether or not the console is open. In this scenario, only one message is involved, and it is just sitting in the queue invisible to the consumers.

Has anyone else seen this behavior and what I can do to improve it?

Here is the sdk for java I am using:

<dependency>
  <groupId>com.amazonaws</groupId>
  <artifactId>aws-java-sdk</artifactId>
  <version>1.5.2</version>
</dependency>     

Here is the code that does the reading (max=10,maxwait=0 startup config):

void read(MessageConsumer consumer) {

  List<Message> messages = read(max, maxWait);

  for (Message message : messages) {
    if (tryConsume(consumer, message)) {
      delete(message.getReceiptHandle());
    }
  }
}

private List<Message> read(int max, int maxWait) {

  AmazonSQS sqs = getClient();
  ReceiveMessageRequest rq = new ReceiveMessageRequest(queueUrl);
  rq.setMaxNumberOfMessages(max);
  rq.setWaitTimeSeconds(maxWait);
  List<Message> messages = sqs.receiveMessage(rq).getMessages();

  if (messages.size() > 0) {
    LOG.info("read {} messages from SQS queue",messages.size());
  }

  return messages;
}

The log line for "read .." never appears when this is happening, and its what causes me to go in with the console and see if the message is there or not, and it is.

Jerico Sandhorn
  • 1,880
  • 1
  • 18
  • 24
  • I faced the same issue, See if this helps http://stackoverflow.com/questions/18264586/sqs-message-always-stays-inflight – vijay Nov 08 '13 at 01:13
  • Can you add more information. For instance, are you using the standard AWS SDK, and in what language? Can you show us the code you are using to deal with the messages? – tster Nov 08 '13 at 22:56
  • @tster - thank you, I have updated the question with more detail – Jerico Sandhorn Nov 09 '13 at 13:06
  • We are facing same issue now - using AWSSDK for .Net and JustSaying, still not sure what is the root cause, but the symptoms are identical. Will update this post once have more details. – Darius May 19 '15 at 17:57
  • Experiencing similar issue. When using SQS and Lambda, when Lambda throws runtime exception, message remains in-flight for 5 minutes. After that long 5 minute wait, it then goes to dead letter queue. – Judy007 Jul 21 '20 at 21:32
  • @Judy007 still facing this problem, have you found any solutions yet? – Dipanshu Mahla Feb 03 '23 at 11:00

2 Answers2

138

It sounds like you are misinterpreting what you are seeing.

Messages "in flight" are not pending delivery, they're messages that have already been delivered but not further acted on by the consumer.

Messages are considered to be in flight if they have been sent to a client but have not yet been deleted or have not yet reached the end of their visibility window.

https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-available-cloudwatch-metrics.html

When a consumer receives a message, it has to -- at some point -- either delete the message, or send a request to increase the timeout for that message; otherwise the message becomes visible again after the timeout expires. If a consumer fails to do one of these things, the message automatically becomes visible again. The visibility timeout is how long the consumer has before one of these things must be done.

Messages should not be "in flight" without something having already received them -- but that "something" can include the console itself, as you'll note on the pop-up you see when you choose "View/Delete Messages" in the console (unless you already checked the "Don't show this again" checkbox):

Messages displayed in the console will not be available to other applications until the console stops polling for messages.

Messages displayed in the console are "in flight" while the console is observing the queue from the "View/Delete Messages" screen.

The part that does not make obvious sense is messages being in flight "for several minutes" if your default visibility timeout is only 5 seconds and nothing in your code is increasing that timeout... however... that could be explained almost perfectly by your consumers not properly disposing of the message, causing it to timeout and immediately be redelivered, giving the impression that a single instance of the message was remaining in-flight, when in fact, the message is briefly transitioning back to visible, only to be claimed almost immediately by another consumer, taking it back to in-flight again.

andyb
  • 15
  • 5
Michael - sqlbot
  • 169,571
  • 25
  • 353
  • 427
  • sqlbot - Thanks, but I think you misunderstood what is happening. The "When a consumer receives a message" doesn't apply here because none of the consumers read the message in the first place, including the console. It is written to the queue, but no readers read it. See my edit. – Jerico Sandhorn Nov 06 '13 at 12:55
  • 1
    Based on your description of the problem, my conclusion was that you were inadvertently causing this with the console or you have a consumer listening to the queue that you aren't aware of or that your code that interfaces to SQS is actually getting occasional messages and telling your app nothing was received due to a bug, because the behavior you describe should not happen, otherwise, given the definition of "in flight." – Michael - sqlbot Nov 06 '13 at 13:06
  • 2
    @JericoSandhorn you commented "it sounds like this is something I'll have to live with" but that's not right -- I've never seen this with SQS. I was thinking about ways you could investigate this and I came up with something potentially interesting -- in Cloudwatch, select both the graph for "NumberOfMessagesReceived" and "NumberOfMessagesDeleted". You *should* find that one graph perfectly overlays and completely masks the other; if to some extent they don't, it strongly suggests a problem in the library that you are using or in your consumers, which would cause the symptoms you observe. – Michael - sqlbot Nov 09 '13 at 01:40
  • @sqlbot - its a good idea, but they all eventually get deleted because the app eventually reads them and deletes. Its the long delays that are an issue, rather than the message never being read at all. – Jerico Sandhorn Nov 09 '13 at 13:11
  • Yes, of course... so you are perhaps missing the point of the exercise: A single message can be counted as "deleted" only once, but can be counted as "received" more than once, but that is true only if your consumers aren't doing what you think they're doing. If your counters do not match, the problem is related to something you are doing. – Michael - sqlbot Nov 09 '13 at 13:22
  • I agree, if you look at the code you will see there is intentional behavior to induce redelivery if the application indicates it cannot consume the message but thinks it can later. That's part of the testing, and I see that expected drift between the lines in the metrics. Still, I can use your idea to better track down what is happening, and I do tend to agree this smacks of a consumer not on my radar. To this end I'm going to mark this answer as the right on. Thank you for all the help. – Jerico Sandhorn Nov 10 '13 at 13:08
  • So forgive me for being slow, but how exactly does this translate to a solution? I'm really confused about what to do to fix the problem. – Jason Swett Sep 01 '15 at 14:30
  • 1
    @JasonSwett the solution here addresses the fact that OP did not fully understand what "in flight" messages were. They are messages you have already received. They *should only be* messages you are currently in the middle of processing. If you are seeing this, you either have more consumers running than you realize, or a bug in your code where you are failing to delete processed messages, or requeue them. Unexpected messages "in flight" essentially means your code is "misplacing" messages it has received, somewhere, somehow, failing to act on them after processing. – Michael - sqlbot Sep 01 '15 at 17:37
  • Thanks, that clarifies it for me. – Jason Swett Sep 02 '15 at 13:33
  • @Michael-sqlbot , the issue which I am facing is like, even after message is listened by queue and properly consumed (on listening the message I am submitting new step to EMR, which is happening perfectly fine), The message goes to in flight. It to and fro between in flight and available. As a result the message posted on queue again and again. I am not sure if Acknowledgment of message from EMR step to sqs can be done.It was not happening earlier, since some days it started happening, not sure which property to alter. – Shubham Pandey Dec 05 '18 at 19:18
  • @ShubhamPandey I am not familiar with EMR so I don't understand how it interacts with SQS. Possibly related: https://stackoverflow.com/q/38099650/1695906 – Michael - sqlbot Dec 05 '18 at 20:32
  • @Michael-sqlbot, dont worry about EMR. Just think that sqs listener is successfully listening the message, and even after that the message go back again to in flight mode. – Shubham Pandey Dec 06 '18 at 04:41
  • @ShubhamPandey I am alos facing same issue. My message is in-flight modw from long time. How I can delete mesage which in in-flight mode? – Suraj Dalvi Oct 18 '19 at 07:20
  • @SurajDalvi, As per my understanding you can not delete the In-flight messages. . These are those messages which got listened by the listener/application, and their processing is in progress.You can reduce down the "Default Visibility Timeout" (a sqs configuration/setting). By default its 30 mins. Which means, after 30 mins, the message will be taken out from In-flight mode re-posted to queue do that the application can consume and reprocess again. – Shubham Pandey Oct 18 '19 at 11:15
1

It may happen when you send or lock a message and within some seconds you try to get the fresh list of messages. Amazon SQS stores the data into multiple servers and in multiple data centers http://aws.amazon.com/sqs/faqs/#How_reliably_is_my_data_stored_in_Amazon_SQS.

To get rid of these issues you need to wait more so that queue would have more time to give appropriate results.

Satish Pandey
  • 1,184
  • 4
  • 12
  • 32
  • Satish - thank you .. this sounds like a viable reason. But waiting longer doesn't really get rid of the issue, that's what is happening by default now. It sounds like this is something I have to live with using SqS as opposed to say AMQ, which never has these sorts of delays. – Jerico Sandhorn Nov 06 '13 at 13:00