1

I have a custom logging process that is reading from STDIN and sending the data out via TCP to a scribed logging server. STDIN is in my case an access log that is attached to Apache httpd 2.2 like this in httpd.conf: CustomLog "|/usr/local/bin/serelog" default

My serelog process sometimes goes into uninterruptable sleep under FreeBSD 9.0 and does not return from it. It works reliably under other operating systems though, including FreeBSD 8, Linux 2.6 and Linux 3.1.

How can I find out what could be the reason for the uninterruptable sleep?

The overall structure is like this: httpd --[PIPE]--> serelog --[TCP-CONNECTION]--> scribed

Until now I did the following analysis:

  • Using ps: stat is "D" and wchan is "-". So there is apparently no syscall, which doesn't make too much sense to me, as the process in uninterruptable sleep and should be in kernel land.
  • As the process is in state "D", the process does not react to kill -9 as expected.
  • Attaching truss to serelog externally from a shell: As long as truss is attached, serelog runs smoothly. Shortly (Seconds) after detaching truss from serelog, serelog goes into "D" state.
  • When attaching truss to serelog AFTER it has entered "D" state, truss prints nothing
  • In "D" state, lsof shows that the incoming PIPE is full. This is exected, as in "D" state the process "sleeps" and cannot read any longer. The outgoing TCP-CONNECTION is empty.
  • If I kill the "surrounding" Apache httpd server, the serelog process eventually terminates after (e.g.) 40 minutes.
  • Checking what others report in forums about the uninterruptable problem was not successful: In my setup there is no NFS. And as it is a server there is also no user interaction with CD drives or pluggable hardware.

So I am now stuck with a process that is uninterruptable, is apparently not in a syscall, and works reliably when traced. The only good thing is that I am able to reproduce the behavior in a few seconds or minutes when I send a lot of HTTP requests via JMeter loadtest (5 threads in JMeter).

Any tips on debugging, kernel parameter tuning are appreciated.

Greetings

Kiran
  • 20,167
  • 11
  • 67
  • 99
Christian Esken
  • 472
  • 1
  • 4
  • 11
  • What is the syscall before it hangs? – janm Mar 21 '12 at 13:57
  • I do not know the syscall. That exactly is the point. truss prints nothing when attaching AFTER the problem occurs. When attaching truss BEFORE the problem occurs, the problem will simply not occur. And ps shows "-" as wchan. – Christian Esken Mar 21 '12 at 14:19
  • Could be a kernel bug or regression. Have you filed a PR? – Roland Smith Mar 26 '12 at 22:06
  • @Roland Smith: Yes, in the meantime I have filed a PR on the FreeBSD bugs list: http://www.freebsd.org/cgi/query-pr.cgi?pr=166340 . I will update my question once there is a solution or substantial information. – Christian Esken Mar 29 '12 at 11:10

1 Answers1

1

The issue has proven to be an actual FreeBSD Kernel bug, and is now fixed in the Kernel.

Link to the PR: http://www.freebsd.org/cgi/query-pr.cgi?pr=166340

Proposed Patch: http://lists.freebsd.org/pipermail/freebsd-bugs/2012-May/048610.html

Christian Esken
  • 472
  • 1
  • 4
  • 11