How is it possible that kill -9 for a process on Linux has no effect?

Question

I'm writing a plugin to highlight text strings automatically as you visit a web site. It's like the highlight search results but automatic and for many words; it could be used for people with allergies to make words really stand out, for example, when they browse a food site.

But I have problem. When I try to close an empty, fresh FF window, it somehow blocks the whole process. When I kill the process, all the windows vanish, but the Firefox process stays alive (parent PID is 1, doesn't listen to any signals, has lots of resources open, still eats CPU, but won't budge).

So two questions:

How is it even possible for a process not to listen to kill -9 (neither as user nor as root)?
Is there anything I can do but a reboot?

[EDIT] This is the offending process:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
digulla  16688  4.3  4.2 784476 345464 pts/14  D    Mar28  75:02 /opt/firefox-3.0/firefox-bin

Same with ps -ef | grep firefox

UID        PID  PPID  C STIME TTY          TIME CMD
digulla  16688     1  4 Mar28 pts/14   01:15:02 /opt/firefox-3.0/firefox-bin

It's the only process left. As you can see, it's not a zombie, it's running! It doesn't listen to kill -9, no matter if I kill by PID or name! If I try to connect with strace, then the strace also hangs and can't be killed. There is no output, either. My guess is that FF hangs in some kernel routine but which?

[EDIT2] Based on feedback by sigjuice:

ps axopid,comm,wchan

can show you in which kernel routine a process hangs. In my case, the offending plugin was the Beagle Indexer (openSUSE 11.1). After disabling the plugin, FF was a quick and happy fox again.

I understand that this is not directly programming related but I have a pretty good knowledge of Linux and Unix in general and I'm really wondering how a process can a) eat CPU and b) ignore kill -9? Isn't kill-9 supposed to do its job outside the process? — Aaron Digulla, Mar 29 '09 at 15:22
I think "not programming related" is harsh. If Aaron was modifying the firefox code himself, and asked this exact question about linux, then it would be programming related. Surely OS kernel behaviour has *something* to do with programming? — Steve Jessop, Mar 29 '09 at 16:08
@Aaron The STAT column says "D", which means "Uninterruptible sleep". A process in this state cannot be killed at all. Is your home directory NFS mounted or is Firefox accessing an NFS directory in some other way? — sigjuice, Mar 29 '09 at 16:24
@Aaron "ps axopid,comm,wchan" might show you which kernel routine Firefox is stuck inside. — sigjuice, Mar 29 '09 at 16:26
Whaou ! There are so many less programming related question than this one ! — shodanex, Mar 29 '09 at 16:30
@sigjuice: Thanks, that's probably what I'm looking for. I'll try as soon as I get home. :) — Aaron Digulla, Mar 31 '09 at 13:30
Okay, I rephrased my question so it's more programming related. Let me know if you have some objections. — Aaron Digulla, Mar 31 '09 at 13:41
This *is* programming related as it might be very appropriate for shell scripting for example in a build system or an automated test framework. — Harvey, Mar 31 '09 at 13:51
@sigjuice: Ok, question is open again. Please post your ps command, so I can give you the well deserved +1! — Aaron Digulla, Mar 31 '09 at 14:31
A duplicate of this one on [U&L](http://unix.stackexchange.com/questions/5642/). Could they be merged? — alexei, Nov 20 '12 at 00:22
This question appears to be off-topic because it is about Unix & Linux (unix.stackexchange.com) — slm, Dec 03 '14 at 01:14

score 126 · Accepted Answer · answered Mar 31 '09 at 14:07

126

As noted in comments to the OP, a process status (STAT) of D indicates that the process is in an "uninterruptible sleep" state. In real-world terms, this generally means that it's waiting on I/O and can't/won't do anything - including dying - until that I/O operation completes.

Processes in a D state will normally only be there for a fraction of a second before the operation completes and they return to R/S. In my experience, if a process gets stuck in D, it's most often trying to communicate with an unreachable NFS or other remote filesystem, trying to access a failing hard drive, or making use of some piece of hardware by way of a flaky device driver. In such cases, the only way to recover and allow the process to die is to either get the fs/drive/hardware back up and running so the I/O can complete or to give up and reboot the system. In the specific case of NFS, the mount may also eventually time out and return from the I/O operation (with a failure code), but this is dependent on the mount options and it's very common for NFS mounts to be set to wait forever.

This is distinct from a zombie process, which will have a status of Z.

answered Mar 31 '09 at 14:07

Dave Sherohman

45,363
14
64
102

7

Yes, the dreaded disk sleep :) +1 , this is a well articulated answer. – Tim Post Apr 01 '09 at 05:26
2

Huh. I don't suppose there's a way to set a timeout on an NFS/SMB/etc. mount *after* you've gotten yourself into his situation? – SamB Feb 08 '12 at 22:50
You may also see things stuck in D because of a kernel bug. – poolie May 09 '13 at 08:31
8

A good way to see what's going on is to run `ps -o pid,wchan 1234` (inserting the relevant pid), and that will tell you which *wait channel* it's stuck on in the kernel. You can use that in a bug report, or Google it and it may give you a clue what's going on - whether it's stuck in NFS, some other driver, etc. – poolie May 09 '13 at 08:33
Good answer!!! A zombie process cannot be killed also (indeed, it's a zombie) you have to kill the parent, so the zombie gets a new parent (init) who is always doing a wait(2) and it finally dies on the parent's wait(2). – Luis Colorado Sep 24 '14 at 20:37

score 8 · Answer 2 · edited Mar 31 '09 at 22:51

8

Double-check that the parent-id is really 1. If not, and this is firefox, first try sudo killall -9 firefox-bin. After that, try killing the specific process IDs individually with sudo killall -9 [process-id].

How is it even possible for a process not to listen to kill -9 (neiter as user nor as root)?

If a process has gone <defunct> and then becomes a zombie with a parent of 1, you can't kill it manually; only init can. Zombie processes are already dead and gone - they've lost the ability to be killed as they are no longer processes, only a process table entry and its associated exit code, waiting to be collected. You need to kill the parent, and you can't kill init for obvious reasons.

But see here for more general information. A reboot will kill everything, naturally.

edited Mar 31 '09 at 22:51

Dave Sherohman

45,363
14
64
102

answered Mar 29 '09 at 14:44

John Feminella

303,634
46
339
357

1

Actually zombie is not a process, but a process table entry, which exists for sole purpose of sending SIGCHLD to parent. Init joins automatically, so you can't really have zombie with PPID==1. – vartec Mar 29 '09 at 15:18
And it was my understanding that kill -9 will remove the process from the process table without asking many questions. But it seems that a process can prevent that somehow. How? – Aaron Digulla Mar 29 '09 at 15:30
@vartec: You're right, I didn't really phrase that well. If you'd like to edit it, please go ahead. @Aaron: Processes that are dead don't listen to kill signals (there's nothing to kill, and they still need to be reaped so they can't go away yet). – John Feminella Mar 29 '09 at 15:32
Edited to clarify that a zombie is no longer an actual process. – Dave Sherohman Mar 31 '09 at 22:52

NGI · Answer 3 · 2017-11-14T23:27:42.567

I lately get trapped into a pitfall of Double Fork and had landed to this page before finally finding my answer. The symptoms are identical even if the problem is not the same:

WYKINWYT :What You Kill Is Not What You Thought

The minimal test code is shown below based on an example for an SNMP Daemon

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>

int main(int argc, char* argv[])
{
    //We omit the -f option (do not Fork) to reproduce the problem
    char * options[]={"/usr/local/sbin/snmpd",/*"-f","*/-d","--master=agentx", "-Dagentx","--agentXSocket=tcp:localhost:1706",  "udp:10161", (char*) NULL};

    pid_t pid = fork();
    if ( 0 > pid ) return -1;

    switch(pid)
    {
        case 0: 
        {   //Child launches SNMP daemon
            execv(options[0],options);
            exit(-2);
            break;
        }
        default: 
        {
            sleep(10); //Simulate "long" activity

            kill(pid,SIGTERM);//kill what should be child, 
                              //i.e the SNMP daemon I assume
            printf("Signal sent to %d\n",pid);

            sleep(10); //Simulate "long" operation before closing
            waitpid(pid);
            printf("SNMP should be now down\n");

            getchar();//Blocking (for observation only)
            break;
        }
    }
    printf("Bye!\n");
}

During the first phase the main process (7699) launches the SNMP daemon (7700) but we can see that this one is now Defunct/Zombie. Beside we can see another process (7702) with the options we specified

[nils@localhost ~]$ ps -ef | tail
root       7439      2  0 23:00 ?        00:00:00 [kworker/1:0]
root       7494      2  0 23:03 ?        00:00:00 [kworker/0:1]
root       7544      2  0 23:08 ?        00:00:00 [kworker/0:2]
root       7605      2  0 23:10 ?        00:00:00 [kworker/1:2]
root       7698    729  0 23:11 ?        00:00:00 sleep 60
nils       7699   2832  0 23:11 pts/0    00:00:00 ./main
nils       7700   7699  0 23:11 pts/0    00:00:00 [snmpd] <defunct>
nils       7702      1  0 23:11 ?        00:00:00 /usr/local/sbin/snmpd -Lo -d --master=agentx -Dagentx --agentXSocket=tcp:localhost:1706 udp:10161
nils       7727   3706  0 23:11 pts/1    00:00:00 ps -ef
nils       7728   3706  0 23:11 pts/1    00:00:00 tail

After the 10 sec simulated we will try to kill the only process we know (7700). What we succeed at last with waitpid(). But Process 7702 is still here

[nils@localhost ~]$ ps -ef | tail
root       7431      2  0 23:00 ?        00:00:00 [kworker/u256:1]
root       7439      2  0 23:00 ?        00:00:00 [kworker/1:0]
root       7494      2  0 23:03 ?        00:00:00 [kworker/0:1]
root       7544      2  0 23:08 ?        00:00:00 [kworker/0:2]
root       7605      2  0 23:10 ?        00:00:00 [kworker/1:2]
root       7698    729  0 23:11 ?        00:00:00 sleep 60
nils       7699   2832  0 23:11 pts/0    00:00:00 ./main
nils       7702      1  0 23:11 ?        00:00:00 /usr/local/sbin/snmpd -Lo -d --master=agentx -Dagentx --agentXSocket=tcp:localhost:1706 udp:10161
nils       7751   3706  0 23:12 pts/1    00:00:00 ps -ef
nils       7752   3706  0 23:12 pts/1    00:00:00 tail

After giving a character to the getchar() function our main process terminates but the SNMP daemon with the pid 7002 is still here

[nils@localhost ~]$ ps -ef | tail
postfix    7399   1511  0 22:58 ?        00:00:00 pickup -l -t unix -u
root       7431      2  0 23:00 ?        00:00:00 [kworker/u256:1]
root       7439      2  0 23:00 ?        00:00:00 [kworker/1:0]
root       7494      2  0 23:03 ?        00:00:00 [kworker/0:1]
root       7544      2  0 23:08 ?        00:00:00 [kworker/0:2]
root       7605      2  0 23:10 ?        00:00:00 [kworker/1:2]
root       7698    729  0 23:11 ?        00:00:00 sleep 60
nils       7702      1  0 23:11 ?        00:00:00 /usr/local/sbin/snmpd -Lo -d --master=agentx -Dagentx --agentXSocket=tcp:localhost:1706 udp:10161
nils       7765   3706  0 23:12 pts/1    00:00:00 ps -ef
nils       7766   3706  0 23:12 pts/1    00:00:00 tail

Conclusion

The fact that we ignored the double fork mechanism made us think that the kill action did not succeed. But in fact we simply killed the wrong process !!

By adding the -f option ( Do Not (Double) Fork ) all go as expected

score 1 · Answer 4 · answered Mar 29 '09 at 15:12

1

Is it possible, that this process is restarted (for example by init) just at the time you kill it?

You can check this easily. If the PID is the same after kill -9 PID then the process wasn't killed, but if it has changed the process has been restarted.

answered Mar 29 '09 at 15:12

Georg Schölly

124,188
49
220
267

karim79 · Answer 5 · 2009-03-29T15:04:16.877

0

sudo killall -9 firefox

Should work

EDIT: [PID] changed to firefox

edited Mar 29 '09 at 15:04

answered Mar 29 '09 at 14:42

karim79

339,989
67
413
406

If the problem is to do with multiple instances, that should shut them all down, whereas kill -9 PID only kills the specified instance. – karim79 Mar 29 '09 at 14:47
Interesting. When I wrote my comments, the answer was wrong. Now the answer is correct, and my comments are obsolete. However, there is no edit history on the answer. – Jörg W Mittag Mar 29 '09 at 14:51
If it makes you happy, Jorg, I will in put an EDIT. Having a bad day, are we? – karim79 Mar 29 '09 at 14:54
1

This was more an observation about the site behavior. The fact that posts can be edited, is what makes SO work. However, I didn't know that one can edit posts without them being shown as edited, which is kind of strange. – Jörg W Mittag Mar 29 '09 at 14:58
I think the 'edit X mins ago' thing appears after one minute or so, which is why an edited answer isn't initially marked as edit. I have noticed that too. – karim79 Mar 29 '09 at 15:03

score 0 · Answer 6 · answered Mar 29 '09 at 14:44

0

ps -ef | grep firefox; and you can see 3 process, kill them all.

answered Mar 29 '09 at 14:44

score 0 · Answer 7 · answered Mar 29 '09 at 15:09

0

You can also do a pstree and kill the parent. This makes sure that you get the entire offending process tree and not just the leaf.

answered Mar 29 '09 at 15:09

Eric

320
2
7

How is it possible that kill -9 for a process on Linux has no effect?

7 Answers7

Linked