multiprocessing and sockets. How to wait?

Question

I have a cluster with 4 nodes and a master server. The master dispatches jobs that may take from 30 seconds to 15 minutes to end.

The nodes are listening with a SocketServer.TCPServer and in the master, I open a connection and wait for the job to end.

def run(nodes, args):
    pool = multiprocessing.Pool(len(nodes))
    return pool.map(load_job, zip(nodes, args))

the load_job function sends the data with socket.sendall and right after that, it uses socket.recv (The data takes a long time to arrive).

The program runs fine until about 200 or 300 of theses jobs run. When it breaks, the socket.recv receives an empty string and cannot run any more jobs until I kill the node processes and run them again.

How should I wait for the data to come? Also, error handling in pool is very poor because it saves the error from another process and show without the proper traceback and this error is not so common to repeat...

EDIT: Now I think this problem has nothing to do with sockets:

After some research, looks like my nodes are opening way to many processes (because they also run their jobs in a multiprocessing.Pool) and somehow they are not being closed!

I found these SO question (here and here) talking about zombie processes when using multiprocessing in a daemonized process (exactly my case!).

I'll need to further understand the problem, but for now I'm killing the nodes and restoring them after some time.

score 3 · Accepted Answer · answered Oct 10 '12 at 08:12

(I'm replying to the question before the edit, because I don't understand exactly what you meant in it).

socket.recv is not the best way to wait for data on a socket. The best way I know is to use the select module (documentation here). The simplest use when waiting for data on a single socket would be select.select([your_socket],[],[]), but it can certainly be used for more complex tasks as well.

Regarding the issue of socket.recv receives an empty string; When the socket is a TCP socket (as it is in your case), this means the socket has been closed by the peer. Reasons for this may vary, but the important thing to understand is that after this happens, you will no longer receive any data from this socket, so the best thing you can do with it is close it (socket.close). If you don't expect it to close, this is where you should search for the problem.

Good luck!

Thanks for your answer. The "edit" was what is causing the socket to close... There was a problem in some nodes. — JBernardo, Oct 10 '12 at 11:50

multiprocessing and sockets. How to wait?

1 Answers1