I have had some problems with a server today and I have now boiled it down to that it is not able to get rid of processes that gets a segfault.
After the process gets a seg-fault, the process just keeps hanging, not getting killed.
A test that should cause the error Segmentation fault (core dumped)
.
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
char *buf;
buf = malloc(1<<31);
fgets(buf, 1024, stdin);
printf("%s\n", buf);
return 1;
}
Compile and set permissions with gcc segfault.c -o segfault && chmod +x segfault
.
Running this (and pressing enter 1 time), on the problematic server causes it to hang. I also ran this on another server with the same kernel version (and most of the same packages), and it gets the seg-fault and then quits.
Here are the last few lines after running strace ./segfault
on both of the servers.
Bad server
"\n", 1024) = 1
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0} ---
# It hangs here....
Working server
"\n", 1024) = 1
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0} ---
+++ killed by SIGSEGV (core dumped) +++
Segmentation fault (core dumped)
root@server { ~ }# echo $?
139
When the process hangs (after it have segfaulted), this is how it looks.
Not able to ^c it
root@server { ~ }# ./segfault
^C^C^C
Entry from ps aux
root 22944 0.0 0.0 69700 444 pts/18 S+ 15:39 0:00 ./segfault
cat /proc/22944/stack
[<ffffffff81223ca8>] do_coredump+0x978/0xb10
[<ffffffff810850c7>] get_signal_to_deliver+0x1c7/0x6d0
[<ffffffff81013407>] do_signal+0x57/0x6c0
[<ffffffff81013ad9>] do_notify_resume+0x69/0xb0
[<ffffffff8160bbfc>] retint_signal+0x48/0x8c
[<ffffffffffffffff>] 0xffffffffffffffff
Another funny thing is that I am unable to attach strace
to a hanging segfault process. Doing so actually makes it getting killed.
root@server { ~ }# strace -p 1234
Process 1234 attached
+++ killed by SIGSEGV (core dumped) +++
ulimit -c 0
is sat and ulimit -c
, ulimit -H -c
, and ulimit -S -c
all shows the value 0
- Kernel version:
3.10.0-229.14.1.el7.x86_64
- Distro-version:
Red Hat Enterprise Linux Server release 7.1 (Maipo)
- Running in vmware
The server is working as it should on everything else.
Update
Shutting down abrt (systemctl stop abrtd.service
) fixed the problem with processes already hung after core-dump, and new processes core-dumping. Starting up abrt again did not bring back the problem.
Update 2016-01-26 We got a problem that looked similar, but not quite the same. The initial code used to test:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
char *buf;
buf = malloc(1<<31);
fgets(buf, 1024, stdin);
printf("%s\n", buf);
return 1;
}
was hanging. The output of cat /proc/<pid>/maps
was
00400000-00401000 r-xp 00000000 fd:00 13143328 /root/segfault
00600000-00601000 r--p 00000000 fd:00 13143328 /root/segfault
00601000-00602000 rw-p 00001000 fd:00 13143328 /root/segfault
7f6c08000000-7f6c08021000 rw-p 00000000 00:00 0
7f6c08021000-7f6c0c000000 ---p 00000000 00:00 0
7f6c0fd5b000-7f6c0ff11000 r-xp 00000000 fd:00 14284 /usr/lib64/libc-2.17.so
7f6c0ff11000-7f6c10111000 ---p 001b6000 fd:00 14284 /usr/lib64/libc-2.17.so
7f6c10111000-7f6c10115000 r--p 001b6000 fd:00 14284 /usr/lib64/libc-2.17.so
7f6c10115000-7f6c10117000 rw-p 001ba000 fd:00 14284 /usr/lib64/libc-2.17.so
7f6c10117000-7f6c1011c000 rw-p 00000000 00:00 0
7f6c1011c000-7f6c1013d000 r-xp 00000000 fd:00 14274 /usr/lib64/ld-2.17.so
7f6c10330000-7f6c10333000 rw-p 00000000 00:00 0
7f6c1033b000-7f6c1033d000 rw-p 00000000 00:00 0
7f6c1033d000-7f6c1033e000 r--p 00021000 fd:00 14274 /usr/lib64/ld-2.17.so
7f6c1033e000-7f6c1033f000 rw-p 00022000 fd:00 14274 /usr/lib64/ld-2.17.so
7f6c1033f000-7f6c10340000 rw-p 00000000 00:00 0
7ffc13b5b000-7ffc13b7c000 rw-p 00000000 00:00 0 [stack]
7ffc13bad000-7ffc13baf000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
However, the smaller c code (int main(void){*(volatile char*)0=0;}
) to trigger a segfault did cause a segfault and did not hang...