1

I am using Python (2.7) SocketServer with ForkingMixIn. It worked well.

However sometimes on heavy usage (tons of rapidly connecting/disconnecting clients) the "server" stuck, consuming all the idle CPU (shown 100% CPU by top). If I use strace from CLI on the process it shows it does endless sequence of waitpid() syscall. According to command "ps" there are no child processes though at this point.

After this problem my server implementation goes unusable and only its restarting helps :( Clients can connect but no anwser, I guess just the "backlog" queue is used on OS side, but the python code never accepts the connection.

It can be easily reproduced eg with some privimitive HTTP implementation, and a browser (I used chrome) with CTRL-R (reload) hold down for something like 10 seconds. Of course the problem is triggered without this "brutal" try as well "on normal usage" just more rarely, and it was quite hard to even come with the idea what can be the problem. I wrote my own implementation of something like SocketServer with os.fork(), and socket functions, and it does not have this problem, but I am more happy with some "already ready", and "standard" solution.

The problem: it is not a nice thing, as my script implementing a server can be DoS'ed very easily in this way.

What I could notice: I installed a singal handler for SIGCHLD. It seems if I remove that, I can't reproduce the problem, however then I can see zombie processes (I guess since they are not wait()'ed). Even if I install signal handler with signal.SIG_IGN, I expereince this problem.

Can anybody help what can be the problem and how I can solve this? I'd like use singal handler anyway since it's also not so nice to leave many zombie processes, especially after a long run.

Thanks for any idea.

LGB
  • 728
  • 1
  • 9
  • 20

1 Answers1

0

maybe related: What is the cost of many TIME_WAIT on the server side?

it is possible that you have all your max connections in a time_wait state.

  • check sysctl net.core.somaxconn for maximum connections.
  • check sysctl net.ipv4 for other configuration details (e.g. tw
  • check ulimit -n for max open file descriptors (sockets included)
  • you can try: sysctl net.ipv4.tcp_tw_reuse=1 to quickly reuse those sockets (don't keep it enabled unless you know what you're doing.)
  • check for file handle leaks.

[not-so] stupid question: how is your SocketServer implementation different from the standard one + ForkingMixIn?

However, it is really easy to abuse a ForkingMixIn (fork bomb), you might want to use green threads, e.g. the eventlet library ( http://eventlet.net/doc/index.html )

this might be your problem.

Community
  • 1
  • 1
dnozay
  • 23,846
  • 6
  • 82
  • 104
  • I have only 39 TIME_WAIT connections in my last test to make the server stuck according to netstat (also within some seconds there will be zero of them, but server is already unusable still). Also, I know that forking can be expensive but at stuck there not even a single child process running, only the server itself in an endless loop. What I can see with strace (this repeates without end): waitpid(0, 0xbfa441fc, 0) = -1 ECHILD (No child processes) green threads maybe great, but I can see no problem here caused by forking as not even a single child and server is already unusable. – LGB Oct 11 '12 at 08:10
  • My "SocketServer" implementation is just accept a connection on socket, and do an os.fork() not so much more. I can't tell the difference compared to SocketServer since I don't know how it's implemented exactly :) – LGB Oct 11 '12 at 08:12
  • I also strace'd the server during the test: waitpid(5587, 0xbff4072c, WNOHANG) and similar things at the beginning, at the problem however "pid" parameter of waitpid is always zero which was never been the case before the problem. I guessed it's a SocketServer implementation bug? As no child processes, not so much TIME_WAIT connections etc, I can't see an OS level limit/problem/etc caused this ... – LGB Oct 11 '12 at 08:17
  • please try to attach a gdb session: http://wiki.python.org/moin/DebuggingWithGdb. – dnozay Oct 11 '12 at 09:15
  • if pid=0 for the waitpid call, it will wait for any children in the same process group ( http://linux.die.net/man/2/waitpid ). – dnozay Oct 11 '12 at 09:22
  • I know what it means, just I don't know where it is used (and why) in the python's SocketServer implementation, and why does it cause and endless loop. When I wrote TCP server daemon in C, I also used this syscall, but if it returns with ECHILD errno, then it should not do an endless loop (bug treat as the "no more child process to wait for" event and exit the signal handler for example), so I guess it's a SocketServer implementation bug. – LGB Oct 11 '12 at 10:00