1

I'm using xp_cmdshell on SQL Server 2016 SP1 (soon to upgrade to a newer version) to run a couple of things, mainly a batch file that calls psftp.exe, for sending files over SFTP.

Very occasionally, psftp.exe will be in the process of sending a file and the process will be killed. Alternatively, the process will crash mid-send. Either way, this means the SPID that spawned psftp.exe with xp_cmdshell stays in a KILLED/ROLLBACK state forever, since the Database is waiting for some response from the batch file via the command shell, which never arrives.

You can test this yourself like so:

In a SQL Query window start a notepad.exe process

xp_cmdshell 'notepad.exe'

then KILL the spid you just created and check the output of sp_who2:

54      RUNNABLE                        Administrator   hostname      . master  KILLED/ROLLBACK     0   0   07/24 14:23:02  Microsoft SQL Server Management Studio - Query  54      0  

This will stay like this forever even though no Rollback is happening. You will still have a running notepad.exe process:

C:\Users\Administrator>tasklist | find "notepad"

notepad.exe                   4676 Services                   0      3 788 K

Once you quit / end that notepad.exe process, the SQL Server spid is gone too.

It's possible for this to block another process that's using the same DB as the stuck process, and in turn it's possible for that to block TEMPDB, which is very bad. The only solution I've found is to remote into the server and use Task Manager to kill the stuck process, which unblocks everything.

I have a few questions:

1: Can I prevent this behaviour? Is there some kind of process isolation or task isolation I can do to run the batch file without it being bound to a SPID? as in, could I launch my batch file using some boxed-off process that can't block the DB, and therefore can't get stuck in a rollback state, or if it does, it doesn't block anything else?

2: Can I have a script running to detect this situation and automatically remedy it? e.g. Could I check every 5-10 minutes for a xp_cmdshell process stuck in KILLED\ROLLBACK, figure out the Windows task that's keeping in that state and stop it?

3: Are these issues addressed in a more recent SQL Server edition? Would updating to 2019 solve the issue with minimal effort?

Any advice (other than "Don't use your SQL Server this way, dumbass!") welcomed.

Geoff Griswald
  • 937
  • 12
  • 24
  • Don't use the database this way. Just don't. And the question makes no sense - KILLED means *killed* as in no longer alive - the *database connection* no longer exists. It's not Notepad that got rolled back (there's no such thing), it's the connection that got killed. – Panagiotis Kanavos Jan 21 '20 at 10:02
  • If you want to run an external script or command, use a SQL Server Agent job. You can *easily* write a simple script that exports the data you want to files then uploads them wherever needed. I do that to *pull* data using SFTP. Or you can use SSIS to export, transform and transfer the data. No SFTP task out-of-the-box unfortunatelly – Panagiotis Kanavos Jan 21 '20 at 10:02
  • [This article](https://www.mssqltips.com/sqlservertip/3435/using-sftp-with-sql-server-integration-services/) shows how to use `psftp` through a Process Task in SSIS. I've also used Psh-SSH in a Powershell agent job. Check [Upload File to SFTP Using Powershell](https://stackoverflow.com/questions/38732025/upload-file-to-sftp-using-powershell) – Panagiotis Kanavos Jan 21 '20 at 10:08
  • 1) No. 2) Yes -- `sys.dm_exec_sessions` contains the `host_process_id`. 3) No. – Jeroen Mostert Jan 21 '20 at 10:32
  • Jeroen, I was able to find the PID of the parent process using that table. E.G if I ran my batch file using SSMS, it gives me the PID of my SSMS connection. If I ran it using an agent job, it gives me the PID of the SQL Agent Job container. What I need is the PID of the program. In this case notepad.exe but in real life it would be my PSFTP.exe or the command shell containing the batch file that ran that executable... – Geoff Griswald Jan 21 '20 at 12:39
  • Do you? Because according to your problem description, that program crashed. And `cmd` isn't going to be hanging around either if the child process got killed. It seems more likely in that case that you're diagnosing a hang. It is, of course, possible to explicitly walk the process tree to find the child processes of the PID that the connection's associated with, but at that point it starts falling into the "already more trouble than it's worth" bucket and the "shall we not rather build something more reliable instead" bucket starts looking better and better. – Jeroen Mostert Jan 21 '20 at 13:34
  • This solution saves me ever having to touch SSIS, which makes any amount of trouble worth it IMO. If I did want to walk that process tree and see what the child processes were, how would one start to do that? – Geoff Griswald Jan 21 '20 at 13:38
  • [`pskill`](https://learn.microsoft.com/sysinternals/downloads/pskill) promises to kill "descendants". You can also write custom code that uses the WMI `Win32_Process` class, or go really native with the Toolhelp32 functions, but if you knew how to call those you probably wouldn't be asking the question. :-) – Jeroen Mostert Jan 21 '20 at 13:55
  • FYI, pskill.exe only works on my Windows Server 2012 machine if I set it to "Windows XP" compatibility mode, but it does the job well. Thanks again :) – Geoff Griswald Jan 21 '20 at 14:50
  • It works with no particular flags or options on my Windows 10 machine, but then I'm not using any of the fancy remoting stuff, which might add complications. You may possibly be running into elevation issues that the XP mode "solves" for you the wrong way (it will not auto-elevate, so you need actual administrative permission with an elevated process/shell to kill from the command line). – Jeroen Mostert Jan 21 '20 at 15:15
  • Ah. I see... Yes I'm getting some weird permissions issues doing it that way actually... Just needed to give my Service Account proper admin powers, to be able to run it not in XP Compatiblity mode. – Geoff Griswald Jan 21 '20 at 16:13
  • I found a better way anyway, my system admin doesn't really like pskill. If I use this: "wmic process where (ParentProcessId=2480) get Caption,ProcessId" Where "2480" is the SQLAGENT.EXE process identified by my query on dm_exec_sessions, then it gives me all child processes. I just follow those down until I get the last one in the chain, and that's my notepad.exe, and in real life that will be my psftp.exe! – Geoff Griswald Jan 21 '20 at 16:39

1 Answers1

1

Thanks to some pointers from @Jeroen Mostert I was able to come up with the following solution. It's a stored procedure that I run periodically from an Agent Job, which checks for and fixes sessions stuck in KILLED/ROLLBACK that were caused by killing a session that was in the middle of running an xp_cmdshell task:

CREATE PROCEDURE [dbo].[sp_KillStuckSessions]

AS
BEGIN
SET NOCOUNT ON;

DECLARE @ErrorText varchar(4000)

--Find out if we have any Sessions stuck in a KILLED/ROLLBACK state:
        IF NOT EXISTS ( SELECT a.spid
                          FROM sys.sysprocesses a WITH(NOLOCK)
                    INNER JOIN sys.dm_exec_sessions b WITH(NOLOCK) ON a.spid = b.session_id
                         WHERE a.cmd = 'KILLED/ROLLBACK'
                           AND DATEDIFF(minute,a.last_batch,GETDATE()) > 5)
--If we don't, skip to the end and do nothing.      
      GOTO Endpoint

--Find out what the stuck session actually is
   DECLARE @StuckTaskName varchar(1000)
   DECLARE @NumStuckJobs integer
    SELECT @NumStuckJobs = COUNT(*) FROM (
    SELECT (
                SELECT text FROM sys.dm_exec_sql_text(a.sql_handle)) AS SqlCommand
                  FROM sys.sysprocesses a WITH(NOLOCK)
            INNER JOIN sys.dm_exec_sessions b WITH(NOLOCK) ON a.spid = b.session_id
                 WHERE a.cmd = 'KILLED/ROLLBACK'
                   AND DATEDIFF(minute,a.last_batch,GETDATE()) > 5
           ) a

--If its a single xp_cmdshell process, we want to kill it
        IF @NumStuckJobs <> 1 
           BEGIN
                  SET @ErrorText = 'More than 1 session is stuck in KILLED/ROLLBACK - KillStuckSessions cannot kill more than 1. Please remote into the server and fix.'
                 GOTO ErrorHandling 
             END
      ELSE 
           SET @StuckTaskName = (SELECT (SELECT text FROM sys.dm_exec_sql_text(a.sql_handle)) AS SqlCommand
                                   FROM sys.sysprocesses a WITH(NOLOCK)
                             INNER JOIN sys.dm_exec_sessions b WITH(NOLOCK) ON a.spid = b.session_id
                                  WHERE a.cmd = 'KILLED/ROLLBACK')

--If its anything else, we want to raise an error message with custom error text.
        IF @StuckTaskName <> 'xp_cmdshell' 
           BEGIN
                  SET @ErrorText = 'A stuck session was caused by something other than xp_cmdshell - KillStuckSessions cannot kill this process. Please remote into the server and fix.'
                 GOTO ErrorHandling
             END

--If we get this far, we know that there's one stuck session and it is an xp_cmdshell task.
--Let's kill the task!

--Declare a bunch of variables including a table variable to store the output of our wmic commands
   DECLARE @wmicdata table (ProcessID varchar(1000) NULL)
   DECLARE @SQLServerPID integer
   DECLARE @PIDTestA1 integer
   DECLARE @PIDTestA2 integer
   DECLARE @PIDTestB1 integer
   DECLARE @PIDTestB2 integer
   DECLARE @OrphanedPID integer
   DECLARE @StuckProcess1 integer
   DECLARE @StuckProcess2 integer

--First we run this to get the PID of sqlserver.exe, which is what will have spawned xp_cmdshell
    INSERT @wmicdata
      EXEC xp_cmdshell 'wmic process where (name="sqlservr.exe") get ProcessID'   
       SET @SQLServerPID = (
    SELECT CAST(SUBSTRING(ProcessID, PATINDEX('%[0-9]%', ProcessID),PATINDEX('%[^0-9]%', SUBSTRING(ProcessID, PATINDEX('%[0-9]%', ProcessID), 8000) + 'X') -1) AS integer)
      FROM @wmicdata
     WHERE STUFF(ProcessID, 1, PATINDEX('%[0-9]%', ProcessID)-1, '') IS NOT NULL
       AND ProcessID IS NOT NULL)

--Next, we run this twice to get the process ID of any stuck child processes of sql server.
   DECLARE @cmdshellInput varchar(1000)
       SET @cmdshellInput = 'wmic process where (ParentProcessId=' + CAST(@SQLServerPID AS varchar(10)) + ') get ProcessID'

--Run 1    
    DELETE @wmicdata
    INSERT @wmicdata
      EXEC xp_cmdshell @cmdshellInput
       SET @PIDTestA1 = (
    SELECT TOP 1 CAST(SUBSTRING(ProcessID, PATINDEX('%[0-9]%', ProcessID),PATINDEX('%[^0-9]%', SUBSTRING(ProcessID, PATINDEX('%[0-9]%', ProcessID), 8000) + 'X') -1) AS integer)
      FROM @wmicdata
     WHERE STUFF(ProcessID, 1, PATINDEX('%[0-9]%', ProcessID)-1, '') IS NOT NULL
       AND ProcessID IS NOT NULL
  ORDER BY ProcessID ASC)
       SET @PIDTestA2 = (
    SELECT TOP 1 CAST(SUBSTRING(ProcessID, PATINDEX('%[0-9]%', ProcessID),PATINDEX('%[^0-9]%', SUBSTRING(ProcessID, PATINDEX('%[0-9]%', ProcessID), 8000) + 'X') -1) AS integer)
      FROM @wmicdata
     WHERE STUFF(ProcessID, 1, PATINDEX('%[0-9]%', ProcessID)-1, '') IS NOT NULL
       AND ProcessID IS NOT NULL
  ORDER BY ProcessID DESC)

--Wait a second
   WAITFOR DELAY '00:00:01'

--Run 2
    DELETE @wmicdata
    INSERT @wmicdata
      EXEC xp_cmdshell @cmdshellInput
       SET @PIDTestB1 = (
    SELECT TOP 1 CAST(SUBSTRING(ProcessID, PATINDEX('%[0-9]%', ProcessID),PATINDEX('%[^0-9]%', SUBSTRING(ProcessID, PATINDEX('%[0-9]%', ProcessID), 8000) + 'X') -1) AS integer)
      FROM @wmicdata
     WHERE STUFF(ProcessID, 1, PATINDEX('%[0-9]%', ProcessID)-1, '') IS NOT NULL
       AND ProcessID IS NOT NULL
  ORDER BY ProcessID ASC)
       SET @PIDTestB2 = (
    SELECT TOP 1 CAST(SUBSTRING(ProcessID, PATINDEX('%[0-9]%', ProcessID),PATINDEX('%[^0-9]%', SUBSTRING(ProcessID, PATINDEX('%[0-9]%', ProcessID), 8000) + 'X') -1) AS integer)
      FROM @wmicdata
     WHERE STUFF(ProcessID, 1, PATINDEX('%[0-9]%', ProcessID)-1, '') IS NOT NULL
       AND ProcessID IS NOT NULL
  ORDER BY ProcessID DESC)

--The number that doesn't immediately change is our orphaned cmd.exe process. 
--The number that does change is the wmic command that we're running to get the PID!
        IF @PIDTestA1 = @PIDTestB1
           SET @OrphanedPID = @PIDTestA1
        IF @PIDTestA1 = @PIDTestB2
           SET @OrphanedPID = @PIDTestA1
        IF @PIDTestA2 = @PIDTestB1
           SET @OrphanedPID = @PIDTestA2
        IF @PIDTestA2 = @PIDTestB2
           SET @OrphanedPID = @PIDTestA2

--If none of them matched, then we had a problem.
        IF @OrphanedPID IS NULL
           BEGIN 
                  SET @ErrorText = 'A stuck session was caused by something other than xp_cmdshell - KillStuckSessions cannot kill this process. Please remote into the server and fix.'
                 GOTO ErrorHandling
             END

--Next, run it again on the orhpaned cmd.exe process to find out the child processes that we want to kill
--There should be 2 tasks total which we want to kill, a conhost.exe and the actual executable which we invoked from xp_cmdshell (psftp.exe)
       SET @cmdshellInput = 'wmic process where (ParentProcessId=' + CAST(@OrphanedPID AS varchar(10)) + ') get ProcessID'
    DELETE @wmicdata
    INSERT @wmicdata
      EXEC xp_cmdshell @cmdshellInput

--Identify Stuck Process 1    
       SET @StuckProcess1 = (
    SELECT TOP 1 CAST(SUBSTRING(ProcessID, PATINDEX('%[0-9]%', ProcessID),PATINDEX('%[^0-9]%', SUBSTRING(ProcessID, PATINDEX('%[0-9]%', ProcessID), 8000) + 'X') -1) AS integer)
      FROM @wmicdata
     WHERE STUFF(ProcessID, 1, PATINDEX('%[0-9]%', ProcessID)-1, '') IS NOT NULL
       AND ProcessID IS NOT NULL
  ORDER BY ProcessID ASC)

--Identify Stuck Process 2
       SET @StuckProcess2 = (
    SELECT TOP 1 CAST(SUBSTRING(ProcessID, PATINDEX('%[0-9]%', ProcessID),PATINDEX('%[^0-9]%', SUBSTRING(ProcessID, PATINDEX('%[0-9]%', ProcessID), 8000) + 'X') -1) AS integer)
      FROM @wmicdata
     WHERE STUFF(ProcessID, 1, PATINDEX('%[0-9]%', ProcessID)-1, '') IS NOT NULL
       AND ProcessID IS NOT NULL
  ORDER BY ProcessID DESC)

--Get the .exe names of the processes we're about to kill
   DECLARE @exe1 AS varchar(1000)
   DECLARE @exe2 AS varchar(1000)
       SET @cmdshellInput = 'wmic process where (ParentProcessId=' + CAST(@OrphanedPID AS varchar(10)) + ') get Name'
    DELETE @wmicdata
    INSERT @wmicdata
      EXEC xp_cmdshell @cmdshellInput
       SET @exe1 = (SELECT TOP 1 ProcessID FROM @wmicdata WHERE ProcessID LIKE '%.%' ORDER BY ProcessID ASC)
       SET @exe2 = (SELECT TOP 1 ProcessID FROM @wmicdata WHERE ProcessID LIKE '%.%' ORDER BY ProcessID DESC)

--Finally, run these to kill the orphaned child tasks
       SET @cmdshellInput = 'taskkill /f /pid ' + CAST(@StuckProcess1 AS varchar(10))
      EXEC xp_cmdshell @cmdshellInput
       SET @cmdshellInput = 'taskkill /f /pid ' + CAST(@StuckProcess2 AS varchar(10))
      EXEC xp_cmdshell @cmdshellInput

--Email to say a process was killed
   DECLARE @EmailBody varchar(4000)
       SET @EmailBody = 'A task causing an SQL session to be stuck in KILLED/ROLLBACK was automatically killed on the server.<P>The name(s) of the executables that were killed:<P>' + @exe1 + '<P>' + @exe2
      EXEC Automation.dbo.sp_SendEmailExternally --This is another stored proc of mine that sends emails. You could swap this for sp_send_dbmail
           @Recipients = 'your@email.here',
           @Subject = 'Agent Job KillStuckSessions found and killed a stuck session!',  
           @HTMLBody = @EmailBody

--Finished! Now the killed/rollback session will be gone. 
--Skip the ErrorHandling section
      GOTO Endpoint

ErrorHandling:
--Since We're running from an Agent Job, using RAISERROR here will generate an Alert email, and the Agent Job history will contain the custom @ErrorText message.
 RAISERROR (@ErrorText, 16, 1)

EndPoint:

END
GO
Geoff Griswald
  • 937
  • 12
  • 24