3

I am utilizing the NuGet package Polly to implement retry logic that traps Failover SQL Exceptions. I have a SQL Server Always On High Availability set up in Azure.

Rather than trap all SQL Exceptions (which would be incorrect), I want to trap specific SQL Exceptions that happen when a failover occurs.

From SSMS, I show dashboard and then I am able to trigger failovers artificially to test my code. Initially, I was letting all exceptions bubble over (so no trapping). Then I would queue a failover and look at my log to see what SQL Exception was raised. Subsequently, I became able to trap all the SQL Exceptions that were happening due to a failover.

My question is, is that a comprehensive list? Do other people who implemented retry logic on a SQL Server Failover trap any other SQL Exceptions?

I’ve tried pretty close to 100 failovers with the logic and nothing bubbled over. That certainly does not mean I have captured the entirety of Failover SQL Exceptions.

Retry logic has Policy.Handle<SqlException>(se => IsFailoverSqlException(se)). Depending on where I am in the code when a failover is queued, I am seeing the three SQL Exceptions that I trap below.

    private static bool IsFailoverSqlException(SqlException se)
    {
        return (
                /*
                A network-related or instance-specific error occurred while establishing a connection to SQL Server.
                The server was not found or was not accessible.
                Verify that the instance name is correct and that SQL Server is configured to allow remote connections.
                */
                (se.Class == 20 && se.State == 0 && se.Number == 53) ||
                /*
                Failed while logging Fatal to MT Database: Unable to access availability database 'MedicusMT' because the database replica is not in the PRIMARY or SECONDARY role.
                Connections to an availability database is permitted only when the database replica is in the PRIMARY or SECONDARY role.
                Try the operation again later.
                */
                (se.Class == 14 && se.State == 1 && se.Number == 983) ||
                // A transport-level error has occurred when receiving results from the server.  (provider: Session Provider, error: 19 - Physical connection is not usable)
                (se.Class == 20 && se.State == 0 && se.Number == -1)
            );
    }
Peter Csala
  • 17,736
  • 16
  • 35
  • 75
Bert
  • 31
  • 2
  • 1
    Microsoft has done some work on catching 'Transient Errors' here: https://github.com/Azure/elastic-db-tools/blob/master/Src/ElasticScale.Client/ElasticScale.Common/TransientFaultHandling/Implementation/SqlDatabaseTransientErrorDetectionStrategy.cs The function IsTransient() can be used in your retry logic... – HansLindgren Apr 25 '22 at 18:09

2 Answers2

0

The behaviour changed when using NET 4.5 and above. SqlException Number may be a Win32 API number What I found was error number depends on the network library used in the SQL Server connection.

The values you have for a named pipe library are good. When communicating to an Azure library there a transient error to handle. Not to repeat a good analysis on the Azure cases. The TCP network library on Windows host will have a Win32 inner exception 1225. Error Number:1225,State:0,Class:20

What I do not like about the SqlException Net45 change is the SqlException.Number is a mixed domain of values. It can be SQL Server Error or Azure Transient Error or Win32 API from underlying the network library.

If you are apply the policy at the Unit of work level consider retry for Dead lock victum SQL Server Error 1205

If you do live upgrades you may find DBA will kill the PSID Error Number:596,State:1,Class:21

user2205317
  • 109
  • 2
-1

Why don't you simplify the logic and just check for the connection state and if needed create a new connection: conn != null || conn.State != ConnectionState.Open

Akin to this article: https://technet.microsoft.com/en-us/library/cc917713.aspx#ECAA

  • The connection state will not tell you anything about the case 2 with "Unable to access..." - this error will only happen when you do an operation towards a specific database which is no longer primary on the sql instance. – Stephan Møller Jun 07 '19 at 08:10