Deadlocks causing 'Server failed to resume the transaction' with NHibernate and distributed transactions

Question

We are having an issue when using NHibernate with distributed transactions.

Consider the following snippet:

//
// There is already an ambient distributed transaction
//
using(var scope = new TransactionScope()) {
    using(var session = _sessionFactory.OpenSession())
    using(session.BeginTransaction()) {
        using(var cmd = new SqlCommand(_simpleUpdateQuery, (SqlConnection)session.Connection)) {
            cmd.ExecuteNonQuery();
        }

        session.Save(new SomeEntity());
        session.Transaction.Commit();
    }
    scope.Complete();
}

Sometimes, when the server is under extreme load, we'll see the following:

The query executed with cmd.ExecuteNonQuery is chosen as a deadlock victim (we can see it in SQL Profiler), but no exception is raised.
session.Save fails with the error message, "The operation is not valid for the state of the transaction."
Every time this code is executed after that, session.BeginTransaction fails. The first few times, the inner exception varies (sometimes it is the deadlock exception that should have been raised in step 1). Eventually it stabilizes to "The server failed to resume the transaction. Desc:3800000177." or "New request is not allowed to start because it should come with valid transaction descriptor."

If left alone, the application will eventually (after seconds or minutes) recover from this condition.

Why is the deadlock exception not being reported in step 1? And if we can't resolve that, then how can we prevent our application from temporarily becoming unusable?

The issue has been reproduced in the following environments

Windows 7 x64 and Windows Server 2003 x86
SQL Server 2005 and 2008
.NET 4.0 and 3.5
NHibernate 3.2, 3.1 and 2.1.2

I've created a test fixture which will sometimes reproduce the issue for us. It is available here: http://wikiupload.com/EWJIGAECG9SQDMZ

See also http://stackoverflow.com/questions/8582206/can-i-use-nhibernates-adonettransactionfactory-with-distributed-transactions — jon without an h, Dec 20 '11 at 21:30
I just addressed a problem very similar to this. What is the lifestyle of the session? — CrazyDart, Jan 11 '12 at 18:02
The SessionFactory is registered as a singleton and created with a factory method. The container does not provide the ISession; it is provided by SessionFactory.GetCurrentSession(). For this we're using WcfOperationSessionContext stolen from the NH3.0 source. — jon without an h, Jan 12 '12 at 20:59
Hmmm, well that is just a bit different from what we are doing. Might I suggest you wrap that session with a using, and perhaps the transaction also? Maybe those dispose methods are not cleaning up correctly when the transaction isnt fully committed? Because the method has a transaction, NHibernate should use the same transaction, right? So a dispose on the Transaction might not actually dispose. Just a thought. — CrazyDart, Jan 13 '12 at 21:34
Please see my latest edits - we've managed to simplify the problem scenario dramatically. — jon without an h, Jan 16 '12 at 19:50
I am unable to reproduce the trouble with .Net Framework 4.0, contrary to what states your question. It seems the trouble requires .Net Framework 3.5 for occurring. (The test case supplied on [NH-3023](https://nhibernate.jira.com/browse/NH-3023) has a bug causing its end to always fail even when removing the "deadlock" part. Once fixed, it does no more fail with Fx4, but only with Fx3.5.) — Frédéric, Jul 02 '17 at 13:33

jon without an h · Accepted Answer · 2012-09-11T19:30:28.597

We've finally narrowed this down to a cause.

When opening a session, if there is an ambient distributed transaction, NHibernate attaches an event handler to the Transaction.TransactionCompleted, which closes the session when the distributed transaction is completed. This appears to be subject to a race condition wherein the connection may be closed and returned to the pool before the deadlock error propagates across, leaving the connection in an unusable state.

The following code will reproduce the error for us occasionally, even without any load on the server. If there is extreme load on the server, it becomes more consistent.

using(var scope = new TransactionScope()) {
    //
    // Force promotion to distributed transaction
    //
    TransactionInterop.GetTransmitterPropagationToken(Transaction.Current);

    var connection = new SqlConnection(_connectionString);
    connection.Open();

    //
    // Close the connection once the distributed transaction is
    // completed.
    //
    Transaction.Current.TransactionCompleted += 
        (sender, e) => connection.Close();

    using(connection.BeginTransaction())
        //
        // Deadlocks but sometimes does not raise exception
        //
        ForceDeadlockOnConnection(connection);

    scope.Complete();
}

//
// Subsequent attempts to open a connection with the same
// connection string will fail
//

We have not settled on a solution, but the following things will eliminate the problem (while possibly having other consequences):

Turning off connection pooling
Using NHibernate's AdoNetTransactionFactory instead of AdoNetWithDistributedTransactionFactory
Adding error handling that calls SqlConnection.ClearPool() when the "server failed to resume the transaction" error occurs

According to Microsoft (https://connect.microsoft.com/VisualStudio/feedback/details/722659/), the SqlConnection class is not thread-safe, and that includes closing the connection on a separate thread. Based on this response we have filed a bug report for NHibernate (http://nhibernate.jira.com/browse/NH-3023).

Thanks, never realized I had mispasted that. Surprised to hear from someone after so much time! — jon without an h, Sep 11 '12 at 19:31
Is it possible for this scenario to persist across process restarts? I have run into an issue where the connection pool is running out of available connections and persists event after restarting the process throwing the error. I have to restart the DTC to get it working again... — Ross Jones, Nov 12 '12 at 19:29
Hi @jon - could you please elaborate on your findings regarding the 3 possible solutions and which one you ended up using? We've been fighting this issue and this would be a big help. I'm especially interested in the factory replacement solution and whether it had any bad side effects for you — Jonas Høgh, Mar 13 '13 at 17:58
@Jonas, here are our results: Option 1 - we ended up going with this because the service was low volume Option 2 - I tried this and it seemed to work, but I could not get enough info about the potential consequences so we decided not to put it in production. Option 3 - I had a working implementation of this but I did not like the complexity so we went with option 1. I don't work at that job anymore so I'm not sure where they are with it now - sorry! — jon without an h, Mar 14 '13 at 19:17
@jon Thanks a lot for the info. I'll try to test option 2 further then. — Jonas Høgh, Mar 14 '13 at 20:52
@JonasH, could you update with your results after testing option 2? This looks quite interesting for us as well. Thanks! — janovesk, May 27 '13 at 08:24
@janovesk I am no longer working on the project in question, but AFAIK, they experienced problems getting option 2 to work. I don't know if an alternative solution was found. — Jonas Høgh, May 28 '13 at 06:02

score 0 · Answer 2 · answered Nov 13 '13 at 14:07

It is an NHibernate issue. NHibernate is not opening and closing the connection on the same thread, which is not supported by ADO.NET. You can work around it by opening and closing the connection yourself. NHibernate will not close the connection unless it has also opened it.

Workaround

var connection = ((SessionFactoryImpl)_sessionFactory).ConnectionProvider.GetConnection();
using(var session = _sessionFactory.OpenSession(connection))
{
   //do database stuff
}
connection.Close();

score 0 · Answer 3 · answered Dec 20 '11 at 21:32

0

not a definitive answer, but i suspect you have some problems with session management and that you are using the same session across multiple calls to handlers. i don't think it's actually the connection that is in a bad state, but rather the nhibernate session. this doesn't seem to jive with you not seeing the problem with connection pooling turned off, so i may be off base, but i still suspect it has to do with reusing sessions.

the first thing i would suggest is that you try to confirm this by logging the hashcode of the session and the hashcode of session.GetSessionImplementation() (my understanding of using the castle nhibernate facility is that you will see the same instance of session, even though it is actually a different session and the session implementation will actually show a difference). see if you are seeing the same hashcodes being used in handling different messages.

if it is a question of session management, try using a nservicebus module to manage your sessions for your handlers. here is a post from andreas about doing that. i don't think his edit about having a way to do this built in on the trunk was in the 2.5 release, so you probably want to go ahead with this. (i could be wrong about that.)

http://andreasohlund.net/2010/02/03/nhibernate-session-management-in-nservicebus/

answered Dec 20 '11 at 21:32

Dave Rael

1,759
2
16
21

Thanks for the response. This service does not actually handle any messages; it only publishes them. It is called by a number of different NServiceBus services which handle messages. We're using Andreas's message module in those services. A session lifestyle issue was one of the first things we suspected; in order to remove this as a possible issue we simplified the session management so that the session is created in the service method (like above), rather than through a WCF extension point which is what we were doing before. – jon without an h Dec 20 '11 at 21:58
I will add some logging for the session hashcode though as it's still possible we have a session management problem. – jon without an h Dec 20 '11 at 22:01
sorry, i didn't notice that you were creating the session right there in the handler. you are probably right and that this is not your issue (assuming the SessionFactory property (assuming it's a propery) is in fact your session factory as opposed to the castle SessionManager, which will give back the same session depending on how its session context is set up). also didn't really notice the wcf stuff there - kinda that tl;dr factor. why do you want to use wcf from the handlers instead of going directly to the database? i can't imagine what that saves you. – Dave Rael Dec 20 '11 at 23:34
This service is an NServiceBus publisher and it is used by a number of different NServiceBus subscribers and also a web app. The service method above would publish a message in addition to saving an entity, but I don't think that part is related; the problem seems to be with DTC, NHibernate and WCF. – jon without an h Dec 21 '11 at 01:08
Also, you're correct that the SessionFactory is just a property of type ISessionFactory. We aren't using the Castle NHibernate facility. – jon without an h Dec 21 '11 at 01:08
sorry, i don't have a good answer for the actual question you asked about the connection pool. i don't think it should be a problem to have your transaction reside inside wcf like that. still, i'd suggest getting out of the business of using wcf at all in this architecture. instead of calling a wcf service, send a message to a nservicebus endpoint that does what the service does publishes the event. taking wcf out of the equation might lend a little more insight into what is happening and maybe the problem is with using wcf itself. – Dave Rael Dec 21 '11 at 16:12

score 0 · Answer 4 · answered Jan 02 '12 at 05:02

0

This doesn't exactly solve your problem, but you could make your IPreInsertEventListener just send a NSB message, and then have the receiver of the message invoke the stored procedure. I've done that with problematic pre-and post event listeners while using NHibernate and NSB in the past.

Another thought is have your pre-event listener create its own connection object wrapped in a nice using statement, then it won't touch NHibernate's connection. If it deadlocks, then just do a throw an make sure you've disposed of any object's in scope.

answered Jan 02 '12 at 05:02

Kevin Up

791
1
6
11

Thanks Kevin, in this case the PreInsertEventListener needs to happen synchronously. I'm going to look into creating a separate connection, as we've seen similar behavior recently in a somewhat different scenario, where the common element was that we were "hijacking" NHibernate's connection to do additional work. – jon without an h Jan 10 '12 at 14:12
I was able to reproduce the same issue, even when creating an entirely separate connection for the "out-of-band" SQL query. – jon without an h Jan 16 '12 at 20:01

Deadlocks causing 'Server failed to resume the transaction' with NHibernate and distributed transactions

4 Answers4

Linked