Our shop has heavy T-SQL stored procedure logic, supporting a web application connecting through JDBC. SQL Server 2008 R2.
From time to time, one of the SQL developers makes a mistake and writes a stored procedure that leaves a transaction open. If you've ever seen this happen in the wild, it's pretty ugly. The connection gets released back to the pool with its rogue, uncommitted transaction. Subsequent transactions seem to be working just fine, but locks are actually accumulating, and work is actually uncommitted. Some period of time later, the blocking gets severe enough that users or monitoring software notices it. It's easy enough to identify the head of the blocking chain and the open transaction, but which procedure left the transaction open, and what can you do about it? At this point the best response seems to be to kill the spid, clear up the blocking, and suffer the consequences.
We have a number of coding standards and practices to prevent this, but a stray RETURN or poorly nested IF block or TRY/CATCH block can cause havoc, usually in some edge case that wasn't fully covered.
The question, for the tl;dr crowd, is this: What techniques are available to ensure that a procedure that starts a transaction doesn't accidentally leave it open on exit, even if the developer makes a dumb mistake?
I realize this is a big ask, but hear me out. We're not looking to prevent any consequences of a coding mistake. We're just looking to contain the consequences to the specific transaction. In short, we'd like to change the behavior of error 266 ("Transaction count after EXECUTE...") to just rollback all transactions. Or even really to commit all transactions. Our fall into the SQL Try/Catch block, so the procedure can respond to the mismatch itself. (If I were king of the world, that would be my choice-- let me catch the SQL runtime error in SQL, and I'll decide what to do.)
Let's say our problem code is in the proc AddProductToRecentlyViewed. Someone recently modified it to do this:
IF @UserName='Grue' RETURN -100; -- Grue is a jerk.
Grue shows up, looks at some products, doesn't see them in Recently Viewed, and goes to lunch. Slowly, this transaction gets passed to the next web transaction, and everything grinds to a halt. Worse, we have actual purchases that seem to be committed, that are in danger of being rolled back. On the operations side, we pretty quickly know there was an error, but it takes a fair bit of digging, and a fair bit of time, to trace it back to this procedure.
It's one thing to absently break Recently Viewed. It's quite another to bring down the whole application.
We do have a standard try/catch block in SQL, similar to the one recommended here. We have unit tests and code reviews, etc. In fact, the error that got through was caught in code review and testing, and corrected-- but the pre-corrected version got promoted accidentally. My VP says, "We're never going to prevent every single dumb mistake. How can we make the next dumb mistake less painful?"
If we had a way to ensure that the procedure just plain couldn't leave a transaction open, we could suffer the consequences of AddProductToRecentlyViewed not working. But I can't find any configuration value or runtime setting that gives me this effect.
Does anyone know of a way to make this more bullet-proof, besides just not making mistakes in T-SQL code?
Here's a couple of things we're considering:
- Implement a standard error handler in the application, to watch for Error 266 and do something. Presumably we'd issue an explicit ROLLBACK from the application error handler, or call a stored proc that just checks XACT_STATE and rolls back and logs. (This is a little challenging for us due to the fact that we don't have a central error handler for DAO's, so it might take some time to add in. Also, we'd really rather enforce the "no open transactions" rule in the database, rather than depending on all calling apps doing it right.)
- On one of our most frequently called stored procs, add in IF XACT_STATE<>0 ROLLBACK or IF XACT_STATE<>0 COMMIT to the very beginning of the proc. This is pretty kludgy, but it has the effect of putting constant downward pressure on the transaction count. It doesn't help identify the root cause, and we'd still have a little time window between when the tran is left open, and when this frequent proc shoots it down. But it would make the wound fairly self-healing while we responded to the log and monitors.
Are there other settings or configurations that help detect and prevent this error, or other solutions you've used to contain it?