3

I'm using the RabbitMQ.Client nuget package to publish messages to rabbitmq from a .NET core 3.1 application. We are using the 5.1.0 version of the library.

We want to improve the resiliency of our application, so we are exploring the possibility to define a retry policy to be used when we send messages via the IModel.BasicPublish method. We are going to employ the Polly nuget package to define the retry policy.

Thw whole point of retry policies is retrying a failed operation when a failure deemed to be transient occurs. What I'm trying to understand is how to identify a transient error in this context.

Based on my understanding, all the exceptions thrown by the RabbitMQ.Client derives from the RabbitMQClientException custom exception. The point is that there are several exception types defined by the library which derives from RabbitMQClientException, see here for the full list.

I didn't find any specific documentation on that, but by reading the code on github it seems that the only custom exception thrown by the library when a message is published is AlreadyClosedException, this happens when the connection used to publish the message is actually closed. I don't think that retrying in this case makes sense: the connection is already closed, so there is no way to overcome the error by simply retrying the operation.

So my question is: what exception types should I handle in my Polly retry policy which I want to use to execute the IModel.BasicPublish call ? Put another way, which are the exception types representing transient errors thrown by IModel.BasicPublish?

Enrico Massone
  • 6,464
  • 1
  • 28
  • 56
  • 1
    Have you read [this part of the documentation](https://www.rabbitmq.com/dotnet-api-guide.html#recovery)? – Peter Csala Sep 03 '21 at 05:23
  • 2
    [In this microsoft sample application](https://github.com/dotnet-architecture/eShopOnContainers/blob/31ab9b62b9fb02fb1c1eb7cadef285c5e6ca6731/src/BuildingBlocks/EventBus/EventBusRabbitMQ/EventBusRabbitMQ.cs#L69) they handle BrokerUnreachableException and SocketException. – Peter Csala Sep 03 '21 at 05:28
  • @PeterCsala thanks for the comments. I'm aware of the automatic connection recovery feature; unfortunately it seems that it is not enough to guarantee that a message is published. See here: https://www.rabbitmq.com/dotnet-api-guide.html#publishers – Enrico Massone Sep 03 '21 at 11:11
  • @PeterCsala basically if a message is published when the connection is down, there is no automatic retry done by the library when the connection will be recovered by the automatic recover feature. So it is up to the developer accounting for failed message publication and handle them. – Enrico Massone Sep 03 '21 at 11:14
  • This is also pointed out in this section of the docs: https://www.rabbitmq.com/dotnet-api-guide.html#automatic-recovery-limitations – Enrico Massone Sep 03 '21 at 11:15
  • 1
    So, this problem has two sides. One is related to the underlying connection and the other one is related to the publish operation. In the latter case the broker can send `basic.nack` back to the producer where you might need to republish the message by your own. In the former case the connection may or may not reestablished automatically, but as you said the pending operations might be discarded which means you need manual retry. Which one do you want to solve? – Peter Csala Sep 03 '21 at 12:03
  • @PeterCsala my original intent was to solve the former case describer in your comment (unstable underlying connection). Based on my understanding the latter case of your comment (the basic.nack) does not trigger when the client publish a message to the message broker. According to this documentation https://www.rabbitmq.com/nack.html the basic nack is done by the client, to notify the message broker that it wants to rejects messages deliverd to it by the message broker. Am I missing anything ? – Enrico Massone Sep 03 '21 at 12:12
  • 1
    @PeterCsala after reading this documentation https://www.rabbitmq.com/confirms.html#server-sent-nacks I can confirm that I'm interested in handling publish errors related with connection troubles only. The basic.nack can be issues by the message broker too, as you pointed out, but it seems to be like a corner case. Quotes from the docs: "basic.nack will only be delivered if an internal error occurs in the Erlang process responsible for a queue." So, at least for the first iteration, I want to focus on the connection errors only. – Enrico Massone Sep 03 '21 at 12:27
  • 1
    In this case I would start with the following exceptions: `BrokerUnreachableException`, `ConnectFailureException` and `OperationInterruptedException`. Other exceptions do not seem to be transient one. In other word by re-publishing the same message the outcome will not change (like `ProtocolVersionMismatchException`) I would also capture all the `RabbitMQClientException` in order to analyse their frequency and distribution. – Peter Csala Sep 03 '21 at 13:23
  • I'm asking myself why the sample code you pointed me to is actually handling the SocketException too. I would have expected the client library to catch any SocketException and set it as the InnerException of a custom exception deriving from RabbitMQClientException. – Enrico Massone Sep 03 '21 at 13:28
  • 1
    Well, I have found the [related e-book section](https://docs.microsoft.com/en-us/dotnet/architecture/microservices/multi-container-microservice-net-applications/rabbitmq-event-bus-development-test-environment#implementing-a-simple-publish-method-with-rabbitmq). It says: *The actual code of the Publish method ... is improved by using a Polly retry policy, which retries the task some times in case the RabbitMQ container is not ready. This scenario can occur when docker-compose is starting the containers; for example, the RabbitMQ container might start more slowly than the other containers.* – Peter Csala Sep 03 '21 at 13:51
  • 1
    So, it seems like that code handles race condition rather than an edge case :D I don't know why does it handle `SocketException` as well. – Peter Csala Sep 03 '21 at 13:52

0 Answers0