I'm analyzing a recent load event on my SQS consumer service and am stuck with some SQS Cloudwatch metrics that don't make sense to me. Essentially, it looks like the queue was getting overloaded with messages that aren't accounted for in the metrics. Let me start by summarizing the data in a selected 5 minute period:
- ApproximateNumberOfMessagesVisible: 215,686 -> 233,605 (gain of 17,919 for this period)
- ApproximateNumberOfMessagesNotVisible: 2,239 -> 2,129 (loss of 110 for this period)
- NumberOfMessagesSent: 31,441
- NumberOfMessagesDeleted: 24,665
What is baffling me is that the ApproximateNumberOfMessagesVisible is experiencing a gain (+17k) of many times more than the number of messages that were not processed (NumberOfMessagesSent - NumberOfMessagesDeleted = ~6k).
I've included metrics about the number of invisible messages as well (just incase there was a bunch of invisible messages that suddenly became visible), but that doesn't seem to be the case.
How could this be possible?