17

I have had some problems with a server today and I have now boiled it down to that it is not able to get rid of processes that gets a segfault.

After the process gets a seg-fault, the process just keeps hanging, not getting killed.

A test that should cause the error Segmentation fault (core dumped).

#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
 char *buf;
 buf = malloc(1<<31);
 fgets(buf, 1024, stdin);
 printf("%s\n", buf);
 return 1;
}

Compile and set permissions with gcc segfault.c -o segfault && chmod +x segfault.

Running this (and pressing enter 1 time), on the problematic server causes it to hang. I also ran this on another server with the same kernel version (and most of the same packages), and it gets the seg-fault and then quits.

Here are the last few lines after running strace ./segfault on both of the servers.

Bad server

"\n", 1024)                     = 1
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0} ---
# It hangs here....

Working server

"\n", 1024)                     = 1
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0} ---
+++ killed by SIGSEGV (core dumped) +++
Segmentation fault (core dumped)
root@server { ~ }# echo $?
139

When the process hangs (after it have segfaulted), this is how it looks.

Not able to ^c it

root@server { ~ }# ./segfault

^C^C^C

Entry from ps aux

root 22944 0.0 0.0 69700 444 pts/18 S+ 15:39 0:00 ./segfault

cat /proc/22944/stack

[<ffffffff81223ca8>] do_coredump+0x978/0xb10
[<ffffffff810850c7>] get_signal_to_deliver+0x1c7/0x6d0
[<ffffffff81013407>] do_signal+0x57/0x6c0
[<ffffffff81013ad9>] do_notify_resume+0x69/0xb0
[<ffffffff8160bbfc>] retint_signal+0x48/0x8c
[<ffffffffffffffff>] 0xffffffffffffffff

Another funny thing is that I am unable to attach strace to a hanging segfault process. Doing so actually makes it getting killed.

root@server { ~ }# strace -p 1234
Process 1234 attached
+++ killed by SIGSEGV (core dumped) +++

ulimit -c 0 is sat and ulimit -c, ulimit -H -c, and ulimit -S -c all shows the value 0

  • Kernel version: 3.10.0-229.14.1.el7.x86_64
  • Distro-version: Red Hat Enterprise Linux Server release 7.1 (Maipo)
  • Running in vmware

The server is working as it should on everything else.

Update Shutting down abrt (systemctl stop abrtd.service) fixed the problem with processes already hung after core-dump, and new processes core-dumping. Starting up abrt again did not bring back the problem.

Update 2016-01-26 We got a problem that looked similar, but not quite the same. The initial code used to test:

#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
 char *buf;
 buf = malloc(1<<31);
 fgets(buf, 1024, stdin);
 printf("%s\n", buf);
 return 1;
}

was hanging. The output of cat /proc/<pid>/maps was

00400000-00401000 r-xp 00000000 fd:00 13143328                           /root/segfault
00600000-00601000 r--p 00000000 fd:00 13143328                           /root/segfault
00601000-00602000 rw-p 00001000 fd:00 13143328                           /root/segfault
7f6c08000000-7f6c08021000 rw-p 00000000 00:00 0
7f6c08021000-7f6c0c000000 ---p 00000000 00:00 0
7f6c0fd5b000-7f6c0ff11000 r-xp 00000000 fd:00 14284                      /usr/lib64/libc-2.17.so
7f6c0ff11000-7f6c10111000 ---p 001b6000 fd:00 14284                      /usr/lib64/libc-2.17.so
7f6c10111000-7f6c10115000 r--p 001b6000 fd:00 14284                      /usr/lib64/libc-2.17.so
7f6c10115000-7f6c10117000 rw-p 001ba000 fd:00 14284                      /usr/lib64/libc-2.17.so
7f6c10117000-7f6c1011c000 rw-p 00000000 00:00 0
7f6c1011c000-7f6c1013d000 r-xp 00000000 fd:00 14274                      /usr/lib64/ld-2.17.so
7f6c10330000-7f6c10333000 rw-p 00000000 00:00 0
7f6c1033b000-7f6c1033d000 rw-p 00000000 00:00 0
7f6c1033d000-7f6c1033e000 r--p 00021000 fd:00 14274                      /usr/lib64/ld-2.17.so
7f6c1033e000-7f6c1033f000 rw-p 00022000 fd:00 14274                      /usr/lib64/ld-2.17.so
7f6c1033f000-7f6c10340000 rw-p 00000000 00:00 0
7ffc13b5b000-7ffc13b7c000 rw-p 00000000 00:00 0                          [stack]
7ffc13bad000-7ffc13baf000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

However, the smaller c code (int main(void){*(volatile char*)0=0;}) to trigger a segfault did cause a segfault and did not hang...

xeor
  • 5,301
  • 5
  • 36
  • 59
  • 6
    Check the return value of `malloc`. 100% sure it's NULL. But `malloc(1<<31);` may actually work if there is >2Gb of free memory available. – Jabberwocky Nov 12 '15 at 14:20
  • 1
    @Michael Walz: That's the idea. It's meant to cause a segfault. – Jordan Melo Nov 12 '15 at 14:24
  • Can you give me an example? I don't know much about c/c++ programming. The example program given is supposed to cause a segfault. – xeor Nov 12 '15 at 14:24
  • 4
    @JordanMelo, but I think it _may_ work. If you want to be sure to get a segfault, dereferencing a NULL pointer is better. This will segfault all the time. – Jabberwocky Nov 12 '15 at 14:25
  • @MichaelWalz: The question clearly demonstrates strace saying the program segfaulted so I think that's beside the point. – Jordan Melo Nov 12 '15 at 14:31
  • What do you mean with "just hanging?". It still run as a normal process? I mean, for example, what does top shows about the process? – terence hill Nov 12 '15 at 14:37
  • @terencehill I updated the question with some output of the process that hangs after segfaulting. – xeor Nov 12 '15 at 14:47
  • 5
    Sounds like the machine gets stuck (or just takes a very long time) while creating the core dump. If you turn off core dumps (`ulimit -c 0` on bash), does anything change? – Nate Eldredge Nov 12 '15 at 15:02
  • 1
    you may have the abrtd daemon running. It collects core dumps and reports them. stop it and disable it. – meuh Nov 12 '15 at 15:17
  • Note: (1<<31) is approx. 4 gig, or minus 2 gig, not 2 gig – user3629249 Nov 12 '15 at 16:13
  • @NateEldredge thanks for the tip, I will try it out when I get back to work tomorrow. – xeor Nov 12 '15 at 16:15
  • @meuh funny you should mention abrtd. That was the daemon that triggered me to dive into this. I wasnt able to login as root, terminal was hanging > It was abrt-status's fault, it was hanging > That was abrt-status waiting for a lock to be released > lock was hold by an abrt report > sosreport generation > step 70/74 (or so), `lsusb -t` > lsusb -t seg-faulted > seg-fault itself was hanging... But yea, I will try to turn it off tomorrow and see what happens – xeor Nov 12 '15 at 16:15
  • on my ubuntu linux 14.04 on a 64bit cpu, 8gig of RAM, the program works fine – user3629249 Nov 12 '15 at 16:17
  • @user3629249, Mikael said in a post above that it could work. But as long as it segfaults on my test, it does its job. If you can provide a better segfault.c, that would be nice. – xeor Nov 12 '15 at 16:21
  • @xeor: The usual thing is `int main(void) { *(volatile char *)0 = 42; return 0; }` – Nate Eldredge Nov 12 '15 at 16:27
  • @NateEldredge I sat and verified that `ulimit -c` is `0`. It did nothing :( – xeor Nov 13 '15 at 08:15
  • @meuh `systemctl stop abrtd.service` made segfalt work again! New ones, and the ones that was hanging already. But abrt was the one that caused this in the very beginning. So, this is probably a bad kernel-level bug in abrt? – xeor Nov 13 '15 at 08:21
  • In case you are unaware - `1 << 31` causes undefined behaviour, meaning that from a C perspective, anything can happen including bizarre processes states being triggered. Although it seems your angle is more that the operating system has an issue if a user process should be able to get into this state at all – M.M Nov 13 '15 at 08:25
  • `malloc(INT_MIN)` would attempt the same thing without causing UB – M.M Nov 13 '15 at 08:26
  • @M.M, thanks for the clarification. I am not able to reproduce this anymore after a stopped `abrt` and it seams to have fixed it. But I will test with `echo 'int main(void){*(volatile char*)0=0;}' > segfault.c && gcc segfault.c -o segfault && chmod +x segfault && ./segfault` next time if that is better. – xeor Nov 13 '15 at 08:30
  • _"A test that should cause the error Segmentation fault (core dumped)."_ No test "should" ever cause a segfault, whether it has UB or not. Period. – Lightness Races in Orbit Jan 26 '16 at 12:29
  • @LightnessRacesinOrbit how should you be able to test why a segment fault is hanging without causing it? Did you even read what the problem is about? – xeor Jan 26 '16 at 22:24

1 Answers1

3

WARNING - this answer contains a number of suppositions based on the incomplete information to hand. Hopefully it is still useful though!

Why does the segfault appear to hang?

As the stack trace shows, the kernel is busy creating a core dump of the crashed process.

But why does this take so long? A likely explanation is that the method you are using to create the segfaults is resulting in the process having a massive virtual address space.

As pointed out in the comments by M.M., the outcome of the expression 1<<31 is undefined by the C standards, so it is difficult to say what actual value is being passed to malloc, but based on the subsequent behavior I am assuming it is a large number.

Note that for malloc to succeed it is not necessary for you to actually have this much RAM in your system - the kernel will expand the virtual size of your process but actual RAM will only be allocated when your program actually accesses this RAM.

I believe the call to malloc succeeds, or at least returns, because you state that it segfaults after you press enter, so after the call to fgets.

In any case, the segfault is leading the kernel to perform a core dump. If the process has a large virtual size, that could take a long time, especially if the kernel decides to dump all pages, even those that have never been touched by the process. I am not sure if it will do that, but if it did, and if there was not enough RAM in the system, it would have to begin swapping pages in and out of memory in order to dump them to the core dump. This would generate a high IO load which could lead to the process to appear to be unresponsive (and overall system performance would be degraded).

You may be able to verify some of this by looking in the abrtd dump directory (possibly /var/tmp/abrt, or check /etc/abrt/abrt.conf) where you may find the core dumps (or perhaps partial core dumps) that have been created.

If you are able to reproduce the behavior, then you can check:

  • /proc/[pid]/maps to see the address space map of the process and see if it really is large
  • Use a tool like vmstat to see if the the system is swapping, the amount of I/O going on, and how much IO Wait state is being experienced
  • If you had sar running then you may be able to see similar information even for the period prior to restarting abrtd.

Why is a core dump created, even though ulimit -c is 0?

According to this bug report, abrtd will trigger collection of a core dump regardless of ulimit settings.

Why did this not start happening again when arbtd was started up once more?

There are a couple of possible explanations for that. For one thing, it would depend on the amount of free RAM in the system. It might be that a single core dump of a large process would not take that long, and not be perceived as hanging, if there is enough free RAM and the system is not pushed to swap.

If in your initial experiments you had several processes in this state, then the symptoms would be far worse than is the case when just getting a single process to misbehave.

Another possibility is that the configuration of abrtd had been altered but the service not yet reloaded, so that when you restarted it, it began using the new configuration, perhaps changing it's behavior.

It is also possible that a yum update had updated abrtd, but not restarted it, so that when you restarted it, the new version was running.

harmic
  • 28,606
  • 5
  • 67
  • 91
  • Thanks for a good reply. The `lsusb -t` program caused a segfault as well, and it had the same behavior (also, the tiny .c snippet in the comments). It hung for days, until killed it. There have been no config-changes to the abrt files, I verified that now. I will check your suggestions if this happens again. That bug-report looks old. Pretty sure that's not part of the problem here.. – xeor Nov 16 '15 at 11:02