6

I have developed many .NET / SQL Server applications but I'm suffering from SQL query timeouts that I can't get to the bottom of. I have lots of experience in this area of finding the offending queries and re-indexing / re-writing them. My web app is hosted on AWS using RDS for SQL Server and EC2 for the Web App. We have 100-200 unique users per day and the database is around 15GB with a couple of tables > 1GB.

I see exceptions throughout the day with the message:

'Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.'

The queries that suffer from timeouts are as random as the time the timeouts occur. It doesn't seem to coincide with anything obvious (backups run overnight etc).

I have tried taking each query from the C# app and running it directly in SQL (with the same SET options like Arith Abort) and they all run just fine. Some are slower queries by nature but the slowest one runs in about 2 seconds and has ~400k logical reads. However, I also see queries timeout that run in 15ms and have < 10 logical reads.

The most odd thing I've seen is I've taken a query from the web app and coded it up into a console app which has been running for 24 hours, calling the query once per second. It has not had a single exception / timeout even though I've seen the main system have timeouts for the same query during the time it's been running.

I have recently upgraded the RDS server to an M5 Large and all indexes are rebuilt overnight every day. I have run DBCC FREEPROCCACHE at some point to ensure there are no stale query plans causing the problem.

I feel it's parameter sniffing or my last thought is hardware / network glitches but that really clutching at straws!

The stack trace I get looks like it's mid-query and not during the connection phase.

at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)  
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)  
   at System.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)  
   at System.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()  
   at System.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()  
   at System.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()  
   at System.Data.SqlClient.TdsParserStateObject.TryReadByteArray(Byte[] buff, Int32 offset, Int32 len, Int32& totalRead)  
   at System.Data.SqlClient.TdsParserStateObject.TryReadString(Int32 length, String& value)  
   at System.Data.SqlClient.TdsParser.TryReadSqlStringValue(SqlBuffer value, Byte type, Int32 length, Encoding encoding, Boolean isPlp, TdsParserStateObject stateObj)  
   at System.Data.SqlClient.TdsParser.TryReadSqlValue(SqlBuffer value, SqlMetaDataPriv md, Int32 length, TdsParserStateObject stateObj, SqlCommandColumnEncryptionSetting columnEncryptionOverride, String columnName)  
   at System.Data.SqlClient.SqlDataReader.TryReadColumnInternal(Int32 i, Boolean readHeaderOnly)  
   at System.Data.SqlClient.SqlDataReader.TryReadColumn(Int32 i, Boolean setTimeout, Boolean allowPartiallyReadColumn)  
   at System.Data.SqlClient.SqlDataReader.GetValueInternal(Int32 i)  
   at System.Data.SqlClient.SqlDataReader.GetValue(Int32 i) 

Any help with some techniques to get to the bottom of this would be much appreciated as it's unsettling and I fear it's suddenly going to get a lot worse.

Thanks

EDIT 1

I have tried to create the same problem locally by running the test app (as above) once every 10ms and running a slow blocking transaction in SSMS at the same time.

Query From App

SELECT TOP 10 *
FROM MyTable
WHERE LastModifiedBy = 'Stu'

Query in SSMS

BEGIN TRAN
UPDATE TOP (10000) MyTable SET LastModifiedBy = 'Me' where LastModifiedBy = 'Me'
WAITFOR DELAY '00:00:35'
COMMIT

When this errors I see what I'd usually expect to see in SQL Profiler where the app query takes exactly 30000ms and I get an exception in the app. However, the useful output from this is the stack trace is different from the one I see in production (above).

at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction) 
   at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction) 
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose) 
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) 
   at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData() 
   at System.Data.SqlClient.SqlDataReader.get_MetaData() 
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString, Boolean isInternal, Boolean forDescribeParameterEncryption, Boolean shouldCacheForAlwaysEncrypted) 
   at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, Boolean inRetry, SqlDataReader ds, Boolean describeParameterEncryptionRequest) 
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry) 
   at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method) 
   at System.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior, String method) 
   at System.Data.SqlClient.SqlCommand.ExecuteDbDataReader(CommandBehavior behavior) 
   at System.Data.Common.DbCommand.System.Data.IDbCommand.ExecuteReader(CommandBehavior behavior) 

I'm reading this stack trace as the query never started to execute since it's still trying to read meta-data for the query. However, this contrasts with the stack trace from production that (to my eyes) appears to be in the middle of reading data from columns but has a timeout mid execution.

I've also been reading about .NET 4.6.2 which is the version we're using. I'll upgrade everything to 4.7.2 this evening to rule that out. (Connection to remote SQL server breaks when upgrading web server to .net framework 4.6.1)

Stu Price
  • 163
  • 12
  • Perhaps long-term blocking? – Dan Guzman Mar 28 '19 at 11:39
  • 1
    Is it only queries or are there updates happening on the DB? – LoztInSpace Mar 28 '19 at 11:40
  • Just queries, updates to the database happen out of normal working hours. @DanGuzman, can you explain 'long term blocking'? – Stu Price Mar 28 '19 at 12:28
  • @StuPrice, I mean blocking for a duration that exceeds the `CommandTimeout` (default 30 seconds). You could set the [`blocked process threshold` config option](https://learn.microsoft.com/en-us/sql/database-engine/configure-windows/blocked-process-threshold-server-configuration-option) to a value like 25 seconds so that episodes are recorded in the SQL Server error log and perhaps create alerts. – Dan Guzman Mar 28 '19 at 12:47
  • looks odd, hmmmm. So, whats the difference between the Console App and the Web App? – Jodrell Apr 03 '19 at 09:35
  • @DanGuzman I've tried your suggestion but RDS doesn't allow me to run sp_configure. I've tried using EXECUTE AS but it still complained. I have looked into the query plan cache and used Brent Ozar's First Responder kit to help which has highlighted plenty of issues I've fixed but none have fixed the timeout problem. I'm feeling it is either a network glitch mid-query or it's a good old fashioned locking problem but I just can't see it. One final odd thing is I don't see these queries in Profiler anywhere. Usually you see them with almost exactly 30,000 ms in the duration column. – Stu Price Apr 03 '19 at 09:40
  • @StuPrice, are you profiling starting queries as well as completed queries? Can you use Extended Events or Query Store? – Jodrell Apr 03 '19 at 09:43
  • @Jodrell I've tried to make the two apps exactly the same, running the same queries etc. Actually, the version of Dapper is older in the Web App but it doesn't seem to be at fault. I should add that all of the queries as async awaits with .ConfigureAwait(false) so they can resume on any thread. I haven't traced statement / batch starts as I'd get a tsumani of trace that would be very difficult to dig through. – Stu Price Apr 03 '19 at 11:18
  • I'm also tracing all of the problematic Lock errors (Deadlock, Deadlock Chain, Escalation, Timeout, Timeout (timeout > 0)) and none have been reported in the last hour whereas I've had around 10 timeouts in the same period. – Stu Price Apr 03 '19 at 11:23

1 Answers1

0

After a week of stressful investigation it's fixed!! It's been running now for over 2 hours without a single timeout :-)

Turned out to be some kind of bug or mismatch with .NET v4.6.2.

My Configuration was:

  • SQL Server 2017 Web Edition on AWS RDS
  • .NET v4.6.2
  • Dapper v1.50.5

My Changes are:

  • Install .NET 4.7.2 on Web Server
  • Upgrade Web App and all DLL projects in Visual Studio to use .NET 4.7.2 (ensuring the web.config was updated to <httpRuntime targetFramework="4.7.2" />)
  • Upgrade Dapper via Nuget to the latest v.1.60.0 (I don't think Dapper was at fault, I just upgraded it while doing everything else as it's database related)

These questions helped point me in this direction:

THANK YOU INTERNET - HOW ON EARTH DID I CODE BEFORE YOU CAME ALONG

Stu Price
  • 163
  • 12