3

I have a python2.7 daemon process using a module from http://www.jejik.com/files/examples/daemon.py

The process is a heavy one with about 40 GB RAM usage and 9 child threads. Server uses RHEL 6.3 with 192 GB RAM and enough CPU power.

After starting the process, it lasts for around 3-7 hours, but then it was killed by someone, might be the kernel. But I could not find any hints in dmesg nor kernel log (which I had manually activated), nothings there. When not starting as daemon, I just got the message in terminal: "killed".

The following precautions have been done:

  • resetting the oom score in /proc//oom_score_adj so that the oom killer does not pick the process when sort of resources
  • increasing all rlimits (that can be increased) to maximum
  • set the process nice/priority higher (prio -15)

This problem exists already before applying these precautions, so they are not responsible for the killing

I also have a mechanism to catch all exception, STDERR, STDOUT and log everythings into a rotated log file. But there was nothing interesting just before the process died.

Modules used within the process among others: oracle_cx, ibm_db, suds, wsgi_utils. But all of them always write logs when errors occured.

Anyone know how to trace back the killing? Who and why?

Thank you in advance

  • Have you enabled coredump to see if you get a dump on kill? – Mansour May 06 '13 at 08:59
  • No I have not. The problem is, we use SSD with about 128 GB, and there is not enough space left for core dump when the process had 40 GB RAM. – Kurnia Hendrawan May 06 '13 at 09:08
  • 1
    Given your big RAM and assuming your process doesn't momentarily fill the whole thing up (which causes the kernel to kill it), you could store the coredump on RAM (via a tmpfs). – Mansour May 06 '13 at 09:16
  • possible duplicate of [Who "Killed" my process and why?](http://stackoverflow.com/questions/726690/who-killed-my-process-and-why) – Aaron Digulla May 06 '13 at 15:48

2 Answers2

1

To see who was logged in at the time when the process was killed, use the command last.

If no one was logged in at the time, the process was killed by some signal.

Since this is Python, the easiest way to find out what killed the process is to write a signal handler for all signals and log them. See here how to write a signal handler. See this question how to catch them all.

If you get a core dump, attach an external hard disk with enough space. Or limit the size of the core to 1GB using ulimit; that might be enough to see where it crashed.

Alternatively, start the process using a debugger like gdb; it will make sure you get a prompt when the "core dump" signals have been sent to the process.

Community
  • 1
  • 1
Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
  • No human had killed the process, I'm pretty sure about it. One can't modify the SIGKILL handler. – Kurnia Hendrawan May 06 '13 at 09:21
  • If you see "killed" in the terminal, this doesn't mean it was killed by `SIGKILL`. The message is printed for any signal that isn't handled. And "pretty sure" doesn't cut :-) – Aaron Digulla May 06 '13 at 09:34
  • The Daemon already handles SIGTERM, SIGABRT, SIGINT and SIGSTOP. Other signals would yield other message then 'killed'. I'm now trying to get core dump using ramdisk as suggested by Mansour. – Kurnia Hendrawan May 06 '13 at 09:40
  • I have seen "Killed" for a process that didn't handle SIGCHLD. Also look for overzealous monitor processes that try to protect the system. – Aaron Digulla May 06 '13 at 09:55
  • Could you name some of such monitor processes? I indeed thought of it, but I don't know. Btw, I also use monit to try to restart the process if it died, but it stopped monitoring when the process died, but this is another problem. – Kurnia Hendrawan May 06 '13 at 09:59
  • Typo above: SIGALRM instead of SIGSTOP – Kurnia Hendrawan May 06 '13 at 10:05
  • sysadmins sometimes install tools; I don't have a name handy. Who monitors the machine? They should know :-) Also: Have you looked into `/var/log/kern.log`? – Aaron Digulla May 06 '13 at 15:47
  • And check `/etc/security/limits.conf` as well – Aaron Digulla May 06 '13 at 15:50
0

I think I found the root cause and it might be a bug in Python2.7 After catching all catchable signals and ignoring them I could track some more error messages and get a hint of socket.error. The thing is, such error will first trigger the SIGTERM (trus try to kill the process) and then write into STDERR. My mechanism to catch all STDOUT and STDERR could log the messages because the main process has been killed. This is a problem in daemon anyway. These were the last lines in the log just before the process terminated

2013-05-07 11:05:53,194 - STDERR - ERROR - Traceback (most recent call last):
2013-05-07 11:05:53,304 - STDERR - ERROR -   File "/var/lib/netcam_epd/lib/python2.7/SocketServer.py", line 582, in process_request_thread
2013-05-07 11:05:53,415 - STDERR - ERROR -     self.finish_request(request, client_address)
2013-05-07 11:05:53,489 - STDERR - ERROR -   File "/var/lib/netcam_epd/lib/python2.7/SocketServer.py", line 323, in finish_request
2013-05-07 11:05:53,587 - STDERR - ERROR -     self.RequestHandlerClass(request, client_address, self)
2013-05-07 11:05:53,684 - STDERR - ERROR -   File "/var/lib/netcam_epd/lib/python2.7/SocketServer.py", line 640, in __init__
2013-05-07 11:05:53,835 - STDERR - ERROR -     self.finish()
2013-05-07 11:05:53,887 - STDERR - ERROR -   File "/var/lib/netcam_epd/lib/python2.7/SocketServer.py", line 693, in finish
2013-05-07 11:05:54,084 - STDERR - ERROR -     self.wfile.flush()
2013-05-07 11:05:54,182 - STDERR - ERROR -   File "/var/lib/netcam_epd/lib/python2.7/socket.py", line 303, in flush
2013-05-07 11:05:54,326 - STDERR - ERROR -     self._sock.sendall(view[write_offset:write_offset+buffer_size])
2013-05-07 11:05:54,387 - STDERR - ERROR - error: [Errno 32] Broken pipe

Apperently, this was caused by trying to write to an unwritable socket. I think the library should handle this better with approproiate return value and not just throwing error / exception because socket can be closed at any time in a normal run.

I'll verify if this actually the root couse.