29

My application program crashes with EXIT CODE: 9 (SIGKILL)

I never run any command such as 'kill -9 (pid)' or 'pkill (process name)' that can kill the running process.

Where should I start for debugging in this case?

  1. I tried to dump the stack trace when the program crashes, but I found that the SIGKILL cannot be caught for error handling.

  2. The program uses MPI and runs in cluster environments. It dies after around 1 hour of its run.

Is there any COMMON causes that can incur SIGKILL exception?

(It's running on linux; cent os 7)

syko
  • 3,477
  • 5
  • 28
  • 51
  • Can you post a stack trace? – Adam Hunyadi Nov 30 '16 at 12:17
  • 1
    `strace yourprogram` from a shell prompt. This will produce a _tremendous_ volume of output; ignore all but the last 50 or so lines. If you have no idea what the output means, post those last 50 lines here, _unedited_. (They won't fit into the comments. Use the "edit" link under the tags to edit the text of your question.) – zwol Nov 30 '16 at 12:18
  • 1
    Run the program in a debugger to catch the signal. – Some programmer dude Nov 30 '16 at 12:18
  • @AdamHunyadi It's impossible to catch SIGKILL exception to dump the stack trace... – syko Nov 30 '16 at 12:19
  • go check your valgrind, I think you'll may have some surprises. – Mathieu Van Nevel Nov 30 '16 at 12:19
  • By the way, are you on a Linux system? Is the system running low on free memory? – Some programmer dude Nov 30 '16 at 12:19
  • 1
    As Mathieu Van Nevel mentions, try `valgring 'yourprogram'` if it's available. Are you in control of the network your running in, or could there be an automatic clean up for hanging runs (i.e. some one kills you)? – kabanus Nov 30 '16 at 12:20
  • Well, then I'd first consider printing out checks to find where the crash occur, then if I can't find what is going on, I'd use valgrind or some memory check too to find out what is going on. – Adam Hunyadi Nov 30 '16 at 12:20
  • So, you guys are suspecting kind of memory problems here? – syko Nov 30 '16 at 12:21
  • 2
    @syko: it is a possibilty as the OOM sends a SIGKILL see [this answer](http://stackoverflow.com/a/7181152/104774). [This answer](http://stackoverflow.com/a/624868/104774) suggests that you can check logs to see if this is the case. – stefaanv Nov 30 '16 at 12:24
  • @stefaanv thanks, I will take a look at that problem, OOM. – syko Nov 30 '16 at 12:27
  • Some iterations of Unix will also generate SIGKILL if the process blocks or ignores synchronous fatal signals (SIGSEGV, SIGILL, etc) and then does something that would cause one of them to be generated. `strace` output would tell us which one it is. (The Linux box I'm typing this on does not do this, but CentOS 7 may be different.) – zwol Nov 30 '16 at 12:36
  • @stefaanv thanks, the problem was OOM. – syko Dec 01 '16 at 01:33

1 Answers1

51

@ I answer my own question so that some one can get helps later.

The exception was caused by OutOfMemory.

The process allocates too much memory putting pressures on OS. The OS has a hit man, oom-killer, that kills such processes for the sake of system stability. The oom-killer uses bullets called SIGKILL.

However, since SIGKILL is invisible (it cannot be caught and handled by the application), for some newbies including me, it is not always easy to figure out the true reason for the crash.

One good news is that when the hit man kills your process, it always logs its action at /var/log/messages.

Depending on your OS configuration, oom-killer might not log any message at all. In such a case, you can configure it as well. Search for rsyslog configuration in google.

Finding which process was killed by Linux OOM killer

Community
  • 1
  • 1
syko
  • 3,477
  • 5
  • 28
  • 51
  • 1
    I've got this error on Jenkins, and it turned the build was indeed killed due to OOM. Thanks! – Mikhail May 31 '21 at 16:22