I have a cluster with 4 nodes and a master server. The master dispatches jobs that may take from 30 seconds to 15 minutes to end.
The nodes are listening with a SocketServer.TCPServer
and in the master, I open a connection and wait for the job to end.
def run(nodes, args):
pool = multiprocessing.Pool(len(nodes))
return pool.map(load_job, zip(nodes, args))
the load_job
function sends the data with socket.sendall
and right after that, it uses socket.recv
(The data takes a long time to arrive).
The program runs fine until about 200 or 300 of theses jobs run. When it breaks, the socket.recv
receives an empty string and cannot run any more jobs until I kill the node processes and run them again.
How should I wait for the data to come? Also, error handling in pool
is very poor because it saves the error from another process and show without the proper traceback and this error is not so common to repeat...
EDIT: Now I think this problem has nothing to do with sockets:
After some research, looks like my nodes are opening way to many processes (because they also run their jobs in a multiprocessing.Pool
) and somehow they are not being closed!
I found these SO question (here and here) talking about zombie processes when using multiprocessing
in a daemonized process (exactly my case!).
I'll need to further understand the problem, but for now I'm killing the nodes and restoring them after some time.