Context
We are developing our own web application developed in C# and relying on EF to consume a SQL database.
The application is hosted on Azure in 5 differents datacenter (cloud) but also on customers On-Premise servers.
Issue
We're suffering since many months now from sql timeout error causing slowness from customer point of view mainly.
What we know
This kind of errors show up during working time and there is no one error during the week-end: so it seems to be linked to customer usage on cloud?
The IIS application is hosted on a Windows Server.
The SQL server is hosted as a docker container on a Linux Server on the same private network.
We have a Grafana stack to overview our resources consumption mainly on production server (windows & linux).
- On Linux metrics, we can see a low usage of CPU and Memory seems to be fine too: at least, SQL server container is eating all the free memory possible and we already tried to move from 32Gb to 64Gb in our Azure VM sizing without success...
The timeout error is firing on various different kind of EF query: from insert to basic select one of row in a table...
An interesting thing
Few days ago, an On-Premise customer has installed our product on his server (windows). He also has a sql server outside of the VM: this is where our product database is stored. The application was very slow on this server and we discovered some timeout error like the one we have in cloud! It was the first time it happened on an On-Premise server. The thing is that there is no user load on his server since it was a fresh install not already used by end user.
But the more interesting thing is coming: we tried to install a SQL Express Server on the customer window server in a way to store the database better than using his own sql server outside of the VM.
The result is the following: the application is as fast as possible and no timeout anymore!
We're definetly lost
Many months of investigations lead me to post this topic today because I have no more idea:
- We tried to investigate in SQL Server in Query Store to find which queries could cause Timeout during weeks: we identified few of them and improved them but it didn't help.
- We tried to increase the VM sizing without success
- CPU & Memory & Disk usage seem to be fine, even if it's always hard to be sure of that
Ideas
- Is there any miss configuration in our connectionstring that could explain the difference between external SQL server and local SQL Express?
- Could it be network issue?
- Something else?
Stacktrace
Microsoft.EntityFrameworkCore.DbUpdateException: An error occurred while updating the entries. See the inner exception for details. ---> System.Data.SqlClient.SqlException: Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.ComponentModel.Win32Exception: The wait operation timed out
--- End of inner exception stack trace ---
at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)