0

I have an application that starts subprocesses (currently 64) and does some work. Each process finishes after about 45min but somehow the parent process seems to hung up, because the parent process does not exit and hangs in the join loop.

I start the proccesses like this:

def worker(out_q):
    # do something that takes a lot of time
    print('done working')
    sys.exit(0)

def main():
    procs = []
    out_q = Queue()

    for i in range(opt.num_threads):
        sys.stdout.write("\r\tStarting Worker Process: %d" %(i+1))
        sys.stdout.flush()
        p = multiprocessing.Process(target=worker, args=(out_q,))
        procs.append(p)
        p.start()

     #then i wait for all processes to finish:

    try:
        for i, p in enumerate(procs):
            print("waiting for process %d" %i)
            p.join()
            print("process %d joined" %i)
    except KeyboardInterrupt as e:
        sys.exit(0)
if __name__ == "__main__":
    main()

the only output i see is waiting for process 0 and after all processes are done (i see all processes saying done working, there are still all 64 processes in the process list and the parent process does not finish. It seems that the parent process hung up, because it cant be killed by the task manager.

How can i debug that or do i need to kill the process? why does the process dont get removed from the processlist after calling sys.exit(0) inside the child?

reox
  • 5,036
  • 11
  • 53
  • 98
  • Do the workers put something in `out_q`? Do you consume the queue? – Janne Karila Feb 19 '14 at 14:57
  • yes they do. i saw this one here http://bugs.python.org/issue8237 i check now how full the queue gets and try to implement a queue flushing... – reox Feb 19 '14 at 15:07

2 Answers2

1

You shouldn't do exit(0) in the worker function. This kills your subprocess before it has chance to report success back to parent. In other words subprocess is killed before calling task_done(), thus p.join() will wait forever.

vartec
  • 131,205
  • 36
  • 218
  • 244
  • but the worker is not killed - i can see the process image in procexplorer. Even if i dont do the sys.exit(0), the parent process hung up in that loop... I'll try again without sys.exit(0) - could take a while – reox Feb 19 '14 at 14:37
  • @reox: why don't you test your multiprocessing code commenting out the time consuming part? – vartec Feb 19 '14 at 14:40
  • because then everything works fine ;) just as is said: the problem only occurs if i let it run for a while. i thought about a problem in the time consuming part, but as every process reaches the end and reports back that it's done, i dont think the problem is there – reox Feb 19 '14 at 14:45
  • so i tested it again. every process prints out that its done, but the loop always waits for the first process to join – reox Feb 19 '14 at 14:56
0

This may not be the direct reason of the hung up.

You need to pass the args as tuple (out_q,) or as list [out_q].

p = multiprocessing.Process(target=worker, args=(out_q,))
#                                                ^^^^^^^
falsetru
  • 357,413
  • 63
  • 732
  • 636
  • ah no, that is just a copy and paste mistake... the worker has more arguments and i striped them down. the whole thing works if the runtime is shorter. so this problem only occurs on long runtimes > 10min – reox Feb 19 '14 at 14:31
  • @reox, Did you guard multiprocess-related code with `if __name__ == '__main__': ..` ? – falsetru Feb 19 '14 at 14:32
  • yes, i'll append that to the code i posted. – reox Feb 19 '14 at 14:35