11

Just wanted to run this by you all to see if there are any bright ideas as I have exhausted all of my ideas after an entire day, night and morning of searching. The issues we’re encountering invariably centre around database connectivity when under concurrent usage (selenium test), e.g. timeouts, dropped/closed connections, database server unreachable.

The issue does seem to be restricted to Azure as we’re yet to encounter the issue locally even when running the same selenium test on the same code pointing to the same database (SQL Azure) so it would point to it being some issue with outbound database connectivity in SQL Azure. So far we’ve tried the following:

  1. Azure transient fault handling – We have retry logic in place for when there is a temporary issue with the SQL Azure service itself.
  2. Change communication protocol – We’ve tried both TCP and Named Pipes and we encounter the same issue with both.
  3. Adjust database connection timeout interval – We’ve tried increasing this to no avail.
  4. Addition of Multiple Active Result Sets – We’ve add this to the connection string to no avail.
  5. Connection state check on every query – When we return the DataContext we check for its connection and reopen where necessary.
  6. Turned off connection pooling - We’ve also attempted this without success.
  7. Changed design pattern – We even went to the lengths of implementing a Unit of Work design pattern, where the database connections were being fired up and disposed of after every unit of work but this caused issues elsewhere with lazy loading, passing objects into methods and it would have been too substantial a rework at this point.

  8. Change role size – The last thing I can think to try is to up the role size in case of any implicit connection restrictions in Windows Azure – that’s currently deploying so there’s still half a chance it might work!

The site infrastructure is as follows:

  • DataContext class (extends DbContext) which is a Code First EF context.
  • BusinessLayer class contains a private, non-static DataContext. DataContext is constructor injected into each Manager/Helper class.
  • BusinessLayerService class contains a public, thread static BusinessLayer instance.
  • MVC site uses BusinessLayerService.Instance to access the manager classes which query and update the DataContext they’ve been passed.

Any help would be greatly appreciated.

UPDATE: We upped the VM size to Medium and all it did was mean that the same issue took longer to occur.

When the issues started occuring, a team member noted the following exception occured:

InvalidOperationException: Execution of the command requires an open and available connection. The connection's current state is broken.

This started occuring whenever the database was being hit (wasn't specific to a certain area of code).

Mathew Thompson
  • 55,877
  • 15
  • 127
  • 148
  • Have you tried with classic ADO.Net approach instead of EF? I was told that folks had more issues with EF than ADO.Net when working with SQL Azure. – Gaurav Mantri Sep 25 '12 at 13:53
  • 1
    @GauravMantri The problem is, this and various other projects we already have out there in the wild all use EF, so switching to ADO.NET would be both a time-consuming and costly process :) – Mathew Thompson Sep 25 '12 at 13:57
  • What instance size were you using? Bear in mind that network bandwidth is also limited by instance size, so a Small instance has only ~100Mb. – Mark Rendle Sep 25 '12 at 14:13
  • @MarkRendle it was Small I think, but we've upped it to Medium, which is deploying now... – Mathew Thompson Sep 25 '12 at 14:20
  • 1
    One thing I would do is to enhance the use of the Transient Fault Handling Framework (TFHF). By default it only consumes/handles database-level faults. But there are other faults you need to account for, such as load balancer errors. For example you could get SocketIO errors, which depending on the error itself could be a form of throttling or simple network error that warrants a retry. So I would say make sure you look at the actual exceptions and determine if they should be added to the TFHF. – Herve Roggero Sep 25 '12 at 16:52
  • @MarkRendle Upping the instance size merely only made it longer for the issue to occur :(. The exception that one of our team members got a screenshot of was an InvalidOperationException: Execution of the command requires an open and available connection. The connection's current state is broken. – Mathew Thompson Oct 02 '12 at 13:21
  • 1
    @mattytommo [This](http://stackoverflow.com/questions/2863224/ef-recovery-from-invalidoperationexception-caused-by-server-being-down) may be of some help – Pittsburgh DBA Oct 02 '12 at 15:22
  • @PittsburghDBA Thanks, but it all works fine outside of Azure, so I suspect it's not just our EF implementation, it could be a combination of both Azure and EF. – Mathew Thompson Oct 04 '12 at 09:06
  • As @herve-roggero mentioned this exception should be handled by retry policy (if exception is transient). – Matej Oct 05 '12 at 00:17

1 Answers1

5

I've encountered this kind of issue before. In my case it was to do with Entity Framework ObjectContexts not being disposed of properly, before eventually too many contexts were spun up and the site toppled over. We identified using the Entity Framework Profiler that there were a LOT of unclosed ObjectContexts when errors were being thrown.

It wasn't feasible for us to move to a Unit of Work design pattern (or similar) as it would have required a rewrite of the business layer, therefore, we resolved it by closing the ObjectContexts manually after every page request. We adopted the approach of disposing the context manually in the End Request event of Global.asax, however, other valid approaches would have been to store the context in the HttpContext or implement an IoC container (such as Castle Windsor) with a "per request" lifestyle.

JonKers
  • 381
  • 1
  • 4