4

Please would you suggest how to handle consumer errors in an Azure Service Bus subscription set up to ensure FIFO processing using a session IDs? (See https://learn.microsoft.com/en-us/azure/service-bus-messaging/message-sessions#first-in-first-out-fifo-pattern )

As an example imagine a customer management system posting messages that are consumed by an accounting system. The messages all have the session ID as the AccountID owning the entities so that receipt from the bus is in FIFO order in the scope of each AccountID.

Imagine this message scenario:

  • T1 - CreateAccount 1234
  • T2 - AddCustomer 5678 to Account 1234
  • T3 - RaiseInvoice for Customer 5678 on Account 1234

If the consumer of the messages has the session lock on AccountID=1234, takes a PeekLock on the queue at T2 for the AddCustomer message and then suffers a transient failure of the accounting system, they are not able to add Customer 5678. What should the consumer do?

If they dead-letter the AddCustomer message, they can't go on to process the RaiseInvoice message since that will fail as the Customer 5678 doesn't exist in the accounting system.

If they abandon the AddCustomer, then are they going to spin round a loop of AddCustomer->fail->abondon->AddCustomer until the max delivery count message is reached and the message then dead-letters.

What should the consumer do here to safely respond to the issue?

See https://stackoverflow.com/a/53449282/491752 for confirmation of how the bus behaves. My question is given knowledge of this problem, what should the consumer do?

Dave Potts
  • 1,543
  • 2
  • 22
  • 33

1 Answers1

2

If it's a transient failure then you have two options, one would be to catch the exception yourself and retry the processing. This is what frameworks like Azure functions, masstransit, and nservicebus do. They catch your exception and then call you again with the same message. Very short lived exception circumstances might recover in that time.

The next option is to abandon the message purposely. This puts it back on the queue and it will be redelivered. This will increase the delivery count each time. The hope is that the transient failure resolves before it reaches the max delivery count. If not it will be dead lettered, and that's not ideal.

So what you could also do is tear down the whole consumer when a message processing error occurs. This would enable the session to be reallocated to another consumer and the redelivery would do to them, hopefully they would have the error.

Basically, you need to retry and/or wait in some way till the transient condition passes. You could out exponential back offs between your retries (the new client libraries should extend your lock automatically here), or delays before you teardown a consumer.

If when you say transient error you mean something that lasts and hour or more, you might need to Monitor for errors and pause entire parts of the system (disable all consumers of a queue) until you've restored whatever is broken.

This failure modeling is meat of the challenge to building reliable systems. It's also sort of the fun.

CiaranODonnell
  • 218
  • 1
  • 4
  • Thanks @CiaranODonnell. If we wanted to stop consuming all messages for a particular session ID (AccountID in my example above), is there a way to tell the Service Bus to do that? E.g. suppose an account create is blocked in the consuming system for a business reason that takes a few hours to resolve (e.g. credit check), can we halt the messages in the queue/subscription so that they are subsequently picked up, in message sent order, when we restart? – Dave Potts May 26 '22 at 15:35
  • 1
    You could reschedule them for delivery to the same topic/queue and acknowledge them. That would leave it to service bus to do the delay. – CiaranODonnell May 28 '22 at 20:18