36

I have a C# application that fetches data from SQL Server hosted in a somewhat flaky environment. There is nothing I can do to address the environmental issues, so I need to handle them as gracefully as possible.

To do so, I want to retry operations that are the result of infrastructure failures such as network glitches, SQL servers going off-line because they're being rebooted, query time-outs etc. At the same time, I don't want to retry queries if they've failed for logical errors. I just want those to bubble the exception up to the client.

My question is this: what is the best way to distinguish between environmental problems (lost connections, time-outs) and other kinds of exceptions (things like logical errors that would have happened even if the environment was stable).

Is there a commonly used pattern in C# for dealing with things like this? For example, is there a property I can check on the SqlConnection object to detect failed connections? If not, what is the best way to approach this problem?

For what it is worth, my code isn't anything special:

using (SqlConnection connection = new SqlConnection(myConnectionString))
using (SqlCommand command = connection.CreateCommand())
{
  command.CommandText = mySelectCommand;
  connection.Open();

  using (SqlDataReader reader = command.ExecuteReader())
  {
    while (reader.Read())
    {
      // Do something with the returned data.
    }
  }
}
Adriano Repetti
  • 65,416
  • 20
  • 137
  • 208
Kramii
  • 8,379
  • 4
  • 32
  • 38
  • 2
    I assume you know about try/catch blocks and checking SQLExceptions? – Jack Marchetti Jun 04 '14 at 15:07
  • 1
    One thing would be to to surround in Try catch for connection establishing issues. You can also check the state of the connection via State property of your SqlConnection Object. Being that your not doing an insert there is not a need for this particular example but if you begin to do inserts and updates I would suggest also implementing SqlTransaction Clas http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqltransaction.aspx so in the event or an error you can use the rollback command to rollback changes that bombed out – Bearcat9425 Jun 04 '14 at 15:10
  • I believe there's only one actual type of SqlException available. You need to use a try-catch statement and catch that type of exception and to distinguish between errors you can check the Number property of the exception. See here: http://stackoverflow.com/questions/6221951/sqlexception-catch-and-handling – HaukurHaf Jun 04 '14 at 15:12
  • 2
    @JackMarchetti: I had anticipated using a try/catch block, but am not sure precisely what exceptions to check for, or if I should be checking something other than the exception details. – Kramii Jun 04 '14 at 16:11

6 Answers6

57

One single SqlException (may) wraps multiple SQL Server errors. You can iterate through them with Errors property. Each error is SqlError:

foreach (SqlError error in exception.Errors)

Each SqlError has a Class property you can use to roughly determine if you can retry or not (and in case you retry if you have to recreate connection too). From MSDN:

  • Class < 10 is for errors in information you passed then (probably) you can't retry if first you don't correct inputs.
  • Class from 11 to 16 are "generated by user" then probably again you can't do anything if user first doesn't correct his inputs. Please note that class 16 includes many temporary errors and class 13 is for deadlocks (thanks to EvZ) so you may exclude these classes if you handle them one by one.
  • Class from 17 to 24 are generic hardware/software errors and you may retry. When Class is 20 or higher you have to recreate connection too. 22 and 23 may be serious hardware/software errors, 24 indicates a media error (something user should be warned but you may retry in case it was just a "temporary" error).

You can find a more detailed description of each class here.

In general if you handle errors with their class you won't need to know exactly each error (using error.Number property or exception.Number which is just a shortcut for first SqlError in that list). This has the drawback that you may retry when it's not useful (or error can't be recovered). I'd suggest a two steps approach:

  • Check for known error codes (list error codes with SELECT * FROM master.sys.messages) to see what you want to handle (knowing how). That view contains messages in all supported languages so you may need to filter them by msglangid column (for example 1033 for English).
  • For everything else rely on error class, retrying when Class is 13 or higher than 16 (and reconnecting if 20 or higher).
  • Errors with severity higher than 21 (22, 23 and 24) are serious errors and little waiting won't fix that problems (database itself may also be damaged).

One word about higher classes. How to handle these errors isn't simple and it depends on many factors (including risk management for your application). As a simple first step I wouldn't retry for 22, 23, and 24 when attempting a write operation: if database, file system or media are seriously damaged then writing new data may deteriorate data integrity even more (SQL Server is extremely careful to do not compromise DB for a query even in critical circumstances). A damaged server, it depends on your DB network architecture, might even be hot-swapped (automatically, after a specified amount of time, or when a specified trigger is fired). Always consult and work close to your DBA.

Strategy for retrying depends on error you're handling: free resources, wait for a pending operation to complete, take an alternative action, etc. In general you should retry only if all errors are "retry-able":

bool rebuildConnection = true; // First try connection must be open

for (int i=0; i < MaximumNumberOfRetries; ++i) {
    try {
        // (Re)Create connection to SQL Server
        if (rebuildConnection) {
            if (connection != null)
                connection.Dispose();

            // Create connection and open it...
        }

        // Perform your task

        // No exceptions, task has been completed
        break;
    }
    catch (SqlException e) {
        if (e.Errors.Cast<SqlError>().All(x => CanRetry(x))) {
            // What to do? Handle that here, also checking Number property.
            // For Class < 20 you may simply Thread.Sleep(DelayOnError);

            rebuildConnection = e.Errors
                .Cast<SqlError>()
                .Any(x => x.Class >= 20);

            continue; 
        }

        throw;
    }
}

Wrap everything in try/finally to properly dispose connection. With this simple-fake-naive CanRetry() function:

private static readonly int[] RetriableClasses = { 13, 16, 17, 18, 19, 20, 21, 22, 24 };

private static bool CanRetry(SqlError error) {
    // Use this switch if you want to handle only well-known errors,
    // remove it if you want to always retry. A "blacklist" approach may
    // also work: return false when you're sure you can't recover from one
    // error and rely on Class for anything else.
    switch (error.Number) {
        // Handle well-known error codes, 
    }

    // Handle unknown errors with severity 21 or less. 22 or more
    // indicates a serious error that need to be manually fixed.
    // 24 indicates media errors. They're serious errors (that should
    // be also notified) but we may retry...
    return RetriableClasses.Contains(error.Class); // LINQ...
}

Some pretty tricky ways to find list of non critical errors here.

Usually I embed all this (boilerplate) code in one method (where I can hide all the dirty things done to create/dispose/recreate connection) with this signature:

public static void Try(
    Func<SqlConnection> connectionFactory,
    Action<SqlCommand> performer);

To be used like this:

Try(
    () => new SqlConnection(connectionString),
    cmd => {
             cmd.CommandText = "SELECT * FROM master.sys.messages";
             using (var reader = cmd.ExecuteReader()) {
                 // Do stuff
         }
    });

Please note that skeleton (retry on error) can be used also when you're not working with SQL Server (actually it can be used for many other operations like I/O and network related stuff so I'd suggest to write a general function and to reuse it extensively).

Adriano Repetti
  • 65,416
  • 20
  • 137
  • 208
6

You can simply SqlConnectionStringBuilder properties to sql connection retry.

var conBuilder = new SqlConnectionStringBuilder(Configuration["Database:Connection"]); conBuilder.ConnectTimeout = 90; conBuilder.ConnectRetryInterval = 15; conBuilder.ConnectRetryCount = 6;

Note:- Required .Net 4.5 or later.

Neeraj Singh
  • 151
  • 2
  • 6
4

I don't know of any standard, but here's a list of Sql-Server exceptions that I've generally regarded as retriable, with a DTC flavouring as well:

catch (SqlException sqlEx)
{
    canRetry = ((sqlEx.Number == 1205) // 1205 = Deadlock
        || (sqlEx.Number == -2) // -2 = TimeOut
        || (sqlEx.Number == 3989) // 3989 = New request is not allowed to start because it should come with valid transaction descriptor
        || (sqlEx.Number == 3965) // 3965 = The PROMOTE TRANSACTION request failed because there is no local transaction active.
        || (sqlEx.Number == 3919) // 3919 Cannot enlist in the transaction because the transaction has already been committed or rolled back
        || (sqlEx.Number == 3903)); // The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION.
}

With regards to retries, suggest that at random-ish delay is added in between retries, to reduce the chances of e.g. the same 2 transactions deadlocking again.

With some of the DTC related errors, removing the connection may be necessary (or at worst, SqlClient.SqlConnection.ClearAllPools()) - otherwise a dud connection gets returned to the pool.

StuartLC
  • 104,537
  • 17
  • 209
  • 285
2

In the spirit of keeping concerns separated, I'm picturing three logical layers in this case...

  1. The application layer, which calls a "flaky dependency handler" layer
  2. The "flaky dependency handler" layer, which calls the data access layer
  3. The data access layer, which knows nothing of flakiness

All of the logic for retrying would be in that handler layer, so as to not pollute the data access layer with logic other than communicating with the database. (As such, your data access code doesn't need to change. And wouldn't need to worry about the "flakiness" if it logically needs to change for new features.)

A pattern for re-trying could be based around catching specific exceptions in a counter loop. (The counter is just to prevent infinitely retrying.) Something like this:

public SomeReturnValue GetSomeData(someIdentifier)
{
    var tries = 0;
    while (tries < someConfiguredMaximum)
    {
        try
        {
            tries++;
            return someDataAccessObject.GetSomeData(someIdentifier);
        }
        catch (SqlException e)
        {
            someLogger.LogError(e);
            // maybe wait for some number of milliseconds?  make the method async if possible
        }
    }
    throw new CustomException("Maximum number of tries has been reached.");
}

This would loop some configured number of times, re-trying until either it works or the maximum has been reached. After that maximum number, a custom exception is thrown for the application to handle. You can further fine-tune the exception handling by inspecting the specific SqlException that's caught. Maybe based on the error message you might want to continue with the loop or throw a CustomException.

You can further refine this logic by catching other exception types, examining those, etc. The main point here is that this responsibility is kept isolated to a specific logical layer in the application, as transparent as possible to the other layers. Ideally, the handler layer and the data access layer implement the same interfaces. That way if you ever move the code to a more stable environment and don't need the handler layer anymore it would be trivial to remove it without needing any changes to the application layer.

David
  • 208,112
  • 36
  • 198
  • 279
  • The separation of concerns is nice, but your answer doesn't explain how to distinguish between different types of issues. – Kramii Jun 04 '14 at 16:07
  • @Kramii: Sure it does, that's all in the exception handling. You can catch specific exceptions, examine the messages on those exceptions, etc. All the information you have about any given issue is going to be in the exception object. If you can't distinguish based on that information, then you can't distinguish. But that information alone should get you pretty close. – David Jun 04 '14 at 16:11
2

I don't know of a real standard. You might try look at the Transient Fault Handling Application Block. It's quite robust, but may be a bit too "enterprisey" for some users. Another approach might be to use an aspect framework to trap errors. Or else good old try/catch will work.

As far as determining what to retry, you'll generally want to look at the exception. SqlException provides quite a bit of information regarding the source of your problem, but it can be painful to parse through it. I put together some code to pick them apart and attempt to determine what is retryable and what isn't. This hasn't been maintained in a while, so you should probably take it as a starting point rather than a finished product. Also, this was aimed at SQL Azure, so it may not fully apply to your situation (eg, the resource throttling is an Azure-specific feature, IIRC).

/// <summary>
/// Helps to extract useful information from SQLExceptions, particularly in SQL Azure
/// </summary>
public class SqlExceptionDetails
{
    public ResourcesThrottled SeriouslyExceededResources { get; private set; }
    public ResourcesThrottled SlightlyExceededResources { get; private set; }
    public OperationsThrottled OperationsThrottled { get; private set; }
    public IList<SqlErrorCode> Errors { get; private set; }
    public string ThrottlingMessage { get; private set; }

    public bool ShouldRetry { get; private set; }
    public bool ShouldRetryImmediately { get; private set; }

    private SqlExceptionDetails()
    {
        this.ShouldRetryImmediately = false;
        this.ShouldRetry = true;
        this.SeriouslyExceededResources = ResourcesThrottled.None;
        this.SlightlyExceededResources = ResourcesThrottled.None;
        this.OperationsThrottled = OperationsThrottled.None;
        Errors = new List<SqlErrorCode>();
    }

    public SqlExceptionDetails(SqlException exception) :this(exception.Errors.Cast<SqlError>())
    {
    }

    public SqlExceptionDetails(IEnumerable<SqlError> errors) : this()
    {
        List<ISqlError> errorWrappers = (from err in errors
                                         select new SqlErrorWrapper(err)).Cast<ISqlError>().ToList();
        this.ParseErrors(errorWrappers);
    }

    public SqlExceptionDetails(IEnumerable<ISqlError> errors) : this()
    {
        ParseErrors(errors);
    }

    private void ParseErrors(IEnumerable<ISqlError> errors)
    {
        foreach (ISqlError error in errors)
        {
            SqlErrorCode code = GetSqlErrorCodeFromInt(error.Number);
            this.Errors.Add(code);

            switch (code)
            {
                case SqlErrorCode.ServerBusy:
                    ParseServerBusyError(error);
                    break;
                case SqlErrorCode.ConnectionFailed:
                    //This is a very non-specific error, can happen for almost any reason
                    //so we can't make any conclusions from it
                    break;
                case SqlErrorCode.DatabaseUnavailable:
                    ShouldRetryImmediately = false;
                    break;
                case SqlErrorCode.EncryptionNotSupported:
                    //this error code is sometimes sent by the client when it shouldn't be
                    //Therefore we need to retry it, even though it seems this problem wouldn't fix itself
                    ShouldRetry = true;
                    ShouldRetryImmediately = true;
                    break;
                case SqlErrorCode.DatabaseWorkerThreadThrottling:
                case SqlErrorCode.ServerWorkerThreadThrottling:
                    ShouldRetry = true;
                    ShouldRetryImmediately = false;
                    break;


                //The following errors are probably not going to resolved in 10 seconds
                //They're mostly related to poor query design, broken DB configuration, or too much data
                case SqlErrorCode.ExceededDatabaseSizeQuota:
                case SqlErrorCode.TransactionRanTooLong:
                case SqlErrorCode.TooManyLocks:
                case SqlErrorCode.ExcessiveTempDBUsage:
                case SqlErrorCode.ExcessiveMemoryUsage:
                case SqlErrorCode.ExcessiveTransactionLogUsage:
                case SqlErrorCode.BlockedByFirewall:
                case SqlErrorCode.TooManyFirewallRules:
                case SqlErrorCode.CannotOpenServer:
                case SqlErrorCode.LoginFailed:
                case SqlErrorCode.FeatureNotSupported:
                case SqlErrorCode.StoredProcedureNotFound:
                case SqlErrorCode.StringOrBinaryDataWouldBeTruncated:
                    this.ShouldRetry = false;
                    break;
            }
        }

        if (this.ShouldRetry && Errors.Count == 1)
        {
            SqlErrorCode code = this.Errors[0];
            if (code == SqlErrorCode.TransientServerError)
            {
                this.ShouldRetryImmediately = true;
            }
        }

        if (IsResourceThrottled(ResourcesThrottled.Quota) ||
            IsResourceThrottled(ResourcesThrottled.Disabled))
        {
            this.ShouldRetry = false;
        }

        if (!this.ShouldRetry)
        {
            this.ShouldRetryImmediately = false;
        }

        SetThrottlingMessage();
    }

    private void SetThrottlingMessage()
    {
        if (OperationsThrottled == Sql.OperationsThrottled.None)
        {
            ThrottlingMessage = "No throttling";
        }
        else
        {
            string opsThrottled = OperationsThrottled.ToString();
            string seriousExceeded = SeriouslyExceededResources.ToString();
            string slightlyExceeded = SlightlyExceededResources.ToString();

            ThrottlingMessage = "SQL Server throttling encountered. Operations throttled: " + opsThrottled
                        + ", Resources Seriously Exceeded: " + seriousExceeded
                        + ", Resources Slightly Exceeded: " + slightlyExceeded;
        }
    }

    private bool IsResourceThrottled(ResourcesThrottled resource)
    {
        return ((this.SeriouslyExceededResources & resource) > 0 ||
                (this.SlightlyExceededResources & resource) > 0);
    }

    private SqlErrorCode GetSqlErrorCodeFromInt(int p)
    {
        switch (p)
        {
            case 40014:
            case 40054:
            case 40133:
            case 40506:
            case 40507:
            case 40508:
            case 40512:
            case 40516:
            case 40520:
            case 40521:
            case 40522:
            case 40523:
            case 40524:
            case 40525:
            case 40526:
            case 40527:
            case 40528:
            case 40606:
            case 40607:
            case 40636:
                return SqlErrorCode.FeatureNotSupported;
        }

        try
        {
            return (SqlErrorCode)p;
        }
        catch
        {
            return SqlErrorCode.Unknown;
        }
    }

    /// <summary>
    /// Parse out the reason code from a ServerBusy error. 
    /// </summary>
    /// <remarks>Basic idea extracted from http://msdn.microsoft.com/en-us/library/gg491230.aspx
    /// </remarks>
    /// <param name="error"></param>
    private void ParseServerBusyError(ISqlError error)
    {
        int idx = error.Message.LastIndexOf("Code:");
        if (idx < 0)
        {
            return;
        }

        string reasonCodeString = error.Message.Substring(idx + "Code:".Length);
        int reasonCode;
        if (!int.TryParse(reasonCodeString, out reasonCode))
        {
            return;
        }

        int opsThrottledInt = (reasonCode & 3);
        this.OperationsThrottled = (OperationsThrottled)(Math.Max((int)OperationsThrottled, opsThrottledInt));


        int slightResourcesMask = reasonCode >> 8;
        int seriousResourcesMask = reasonCode >> 16;
        foreach (ResourcesThrottled resourceType in Enum.GetValues(typeof(ResourcesThrottled)))
        {
            if ((seriousResourcesMask & (int)resourceType) > 0)
            {
                this.SeriouslyExceededResources |= resourceType;
            }
            if ((slightResourcesMask & (int)resourceType) > 0)
            {
                this.SlightlyExceededResources |= resourceType;
            }
        }
    }
}

public interface ISqlError
{
    int Number { get; }
    string Message { get; }
}

public class SqlErrorWrapper : ISqlError
{
    public SqlErrorWrapper(SqlError error)
    {
        this.Number = error.Number;
        this.Message = error.Message;
    }

    public SqlErrorWrapper()
    {
    }

    public int Number { get; set; }
    public string Message { get; set; }
}

/// <summary>
/// Documents some of the ErrorCodes from SQL/SQL Azure. 
/// I have not included all possible errors, only the ones I thought useful for modifying runtime behaviors
/// </summary>
/// <remarks>
/// Comments come from: http://social.technet.microsoft.com/wiki/contents/articles/sql-azure-connection-management-in-sql-azure.aspx
/// </remarks>
public enum SqlErrorCode : int
{
    /// <summary>
    /// We don't recognize the error code returned
    /// </summary>
    Unknown = 0,

    /// <summary>
    /// A SQL feature/function used in the query is not supported. You must fix the query before it will work.
    /// This is a rollup of many more-specific SQL errors
    /// </summary>
    FeatureNotSupported = 1,

    /// <summary>
    /// Probable cause is server maintenance/upgrade. Retry connection immediately.
    /// </summary>
    TransientServerError = 40197,

    /// <summary>
    /// The server is throttling one or more resources. Reasons may be available from other properties
    /// </summary>
    ServerBusy = 40501,

    /// <summary>
    /// You have reached the per-database cap on worker threads. Investigate long running transactions and reduce server load. 
    /// http://social.technet.microsoft.com/wiki/contents/articles/1541.windows-azure-sql-database-connection-management.aspx#Throttling_Limits
    /// </summary>
    DatabaseWorkerThreadThrottling = 10928,

    /// <summary>
    /// The per-server worker thread cap has been reached. This may be partially due to load from other databases in a shared hosting environment (eg, SQL Azure).
    /// You may be able to alleviate the problem by reducing long running transactions.
    /// http://social.technet.microsoft.com/wiki/contents/articles/1541.windows-azure-sql-database-connection-management.aspx#Throttling_Limits
    /// </summary>
    ServerWorkerThreadThrottling = 10929,

    ExcessiveMemoryUsage = 40553,

    BlockedByFirewall = 40615,

    /// <summary>
    /// The database has reached the maximum size configured in SQL Azure
    /// </summary>
    ExceededDatabaseSizeQuota = 40544,

    /// <summary>
    /// A transaction ran for too long. This timeout seems to be 24 hours.
    /// </summary>
    /// <remarks>
    /// 24 hour limit taken from http://social.technet.microsoft.com/wiki/contents/articles/sql-azure-connection-management-in-sql-azure.aspx
    /// </remarks>
    TransactionRanTooLong = 40549,

    TooManyLocks = 40550,

    ExcessiveTempDBUsage = 40551,

    ExcessiveTransactionLogUsage = 40552,

    DatabaseUnavailable = 40613,

    CannotOpenServer = 40532,

    /// <summary>
    /// SQL Azure databases can have at most 128 firewall rules defined
    /// </summary>
    TooManyFirewallRules = 40611,

    /// <summary>
    /// Theoretically means the DB doesn't support encryption. However, this can be indicated incorrectly due to an error in the client library. 
    /// Therefore, even though this seems like an error that won't fix itself, it's actually a retryable error.
    /// </summary>
    /// <remarks>
    /// http://social.msdn.microsoft.com/Forums/en/ssdsgetstarted/thread/e7cbe094-5b55-4b4a-8975-162d899f1d52
    /// </remarks>
    EncryptionNotSupported = 20,

    /// <summary>
    /// User failed to connect to the database. This is probably not recoverable.
    /// </summary>
    /// <remarks>
    /// Some good info on more-specific debugging: http://blogs.msdn.com/b/sql_protocols/archive/2006/02/21/536201.aspx
    /// </remarks>
    LoginFailed = 18456,

    /// <summary>
    /// Failed to connect to the database. Could be due to configuration issues, network issues, bad login... hard to tell
    /// </summary>
    ConnectionFailed = 4060,

    /// <summary>
    /// Client tried to call a stored procedure that doesn't exist
    /// </summary>
    StoredProcedureNotFound = 2812,

    /// <summary>
    /// The data supplied is too large for the column
    /// </summary>
    StringOrBinaryDataWouldBeTruncated = 8152
}
Brian Reischl
  • 7,216
  • 2
  • 35
  • 46
0

Refer this docs: with all custom implementation handling most occurring issues.

// Define the retry logic parameters
var options = new SqlRetryLogicOption()
{
    // Tries 5 times before throwing an exception
    NumberOfTries = 5,
    // Preferred gap time to delay before retry
    DeltaTime = TimeSpan.FromSeconds(1),
    // Maximum gap time for each delay time before retry
    MaxTimeInterval = TimeSpan.FromSeconds(20),
    // SqlException retriable error numbers
    TransientErrors = new int[] { 4060, 1024, 1025}
};



// Create a custom retry logic provider
SqlRetryLogicBaseProvider provider = CustomRetry.CreateCustomProvider(options);


    // Assumes that connection is a valid SqlConnection object 
// Set the retry logic provider on the connection instance
connection.RetryLogicProvider = provider;
// Establishing the connection will trigger retry if one of the given transient failure occurs.
connection.Open();

MS DOCS.

Sanjeevi Subramani
  • 501
  • 1
  • 5
  • 16