Multiprocessing Process.join() hangs

Question

I have a worker process that goes like this:

class worker(Process):
    def __init__(self):
        # init stuff
    def run(self):
        # do stuff
        logging.info("done") # to confirm that the process is done running

And I start 3 processes like this:

processes = 3
aproc = [None for _ in processes]
bproc = [None for _ in processes]

for i in range(processes):
    aproc[i] = worker(foo, bar)
    bproc[i] = worker2(foo, bar) # different worker class
    
    aproc[i].start()
    bproc[i].start()

However, at the end of my code, I .join each of the processes, but they just hang and the script never ends.

for i in range(processes):
    aproc[i].join()
    bproc[i].join()

Hitting CTRL+C gives me this traceback:

Traceback (most recent call last):

  File "[REDACTED]", line 571, in <module>

    sproc[0].join()

  File "/usr/lib/python3.9/multiprocessing/process.py", line 149, in join

    res = self._popen.wait(timeout)

  File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 43, in wait

    return self.poll(os.WNOHANG if timeout == 0.0 else 0)

  File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 27, in poll

    pid, sts = os.waitpid(self.pid, flag)

I've heard of the typical deadlock, but this shouldn't be the case since all the processes print the logging statement that they are done running. Why is .join() still waiting on them? Any ideas? Thank you!

Edit: Unfortunately I can't get a minimal example working to share. Also, they do communicate with each other through multiprocessing.Queue()s, if that is relevant.

Edit 2: Traceback of another test:

Traceback (most recent call last):

  File "/usr/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers

    finalizer()

  File "/usr/lib/python3.9/multiprocessing/util.py", line 224, in __call__

    res = self._callback(*self._args, **self._kwargs)

  File "/usr/lib/python3.9/multiprocessing/queues.py", line 201, in _finalize_join

    thread.join()

  File "/usr/lib/python3.9/threading.py", line 1033, in join

    self._wait_for_tstate_lock()

  File "/usr/lib/python3.9/threading.py", line 1049, in _wait_for_tstate_lock

    elif lock.acquire(block, timeout):

traceback for child process? also using spawn rather than fork eliminates deadlock with `logging` threads... — Aaron, Jan 06 '22 at 01:19
How may I get the traceback for the child process? I will test out using spawn instead, thanks! — Frankfurters, Jan 06 '22 at 04:01
If your OS forwards signals to the entire process tree (as many do) the child will automatically print it's traceback to stderr when it receives the interrupt signal. The problem becomes many IDE's override the standard streams of the main process, so the child will be talking to a stream nothing is listening on. Run your code from a terminal instead, and you should get the traceback. — Aaron, Jan 06 '22 at 04:29
Hey, thanks for the reply. I am running the script through an Ubuntu 21.04 terminal, but no additional traceback was printed. Also, after leaving it running and going AFK, it eventually joined but I didn't catch how long it took. Running it again now and logging when it's done. — Frankfurters, Jan 06 '22 at 04:41
afik the child "should" get the same stderr as main then... I am aware that `mp.Pool` and `concurrent.futures` both try to catch and handle exceptions in children rather than just letting them exit, but based on your example, just `mp.Process` shouldn't have any sort of error handling by default. What I know less about is that if child processes all get the same signals that the parent gets (not a regular ubuntu user). perhaps try getting the child pid, and `kill SIGINT childpid` while main is trying to join? I would think that should print a traceback of where the child currently is. — Aaron, Jan 06 '22 at 06:15
I've added an edit which shows a traceback from a different test, but I believe it's the same issue. — Frankfurters, Jan 06 '22 at 07:05
please add more details about process functions. parameters foo and bar are not the same as the function you receive. Function needs one parameter and you send two parameters. — Reza Akraminejad, Jan 06 '22 at 07:20
This is not valid python: `processes = 3; aproc = [None for _ in processes]`. I guess you mean: `aproc = [None] * processes` or use `in range(...)` .Without a MWE it is difficult to help. — deponovo, Jan 06 '22 at 09:41
based on the traceback of the child you posted this sounds a lot like deadlock of a queue. Can you post a simplified example of how you're using queues? those can absolutely cause deadlock... In most cases, the problem is `queue.put` with a full queue that isn't being read by anything. On the other hand this could still relate to logging, which can use a `QueueHandler` sometimes to pass messages for multiprocessing — Aaron, Jan 06 '22 at 15:47
Before you dive down that path however, have you confirmed it's not [this](https://stackoverflow.com/a/29277211/3220135) as I mentioned earlier (about mixing `logging` with "fork"). Basically "fork" is fast, but very unsafe for threads. — Aaron, Jan 06 '22 at 15:53
@Aaron, I'm not sure if the `queue.put` is the problem because I can observe that all processes write to the queue as expected. However the first `join( )` does not complete. Interestingly, I have code with similar structure that works without hanging and the only difference is the function given to the processes. The code that works does 'a bit more' by virtue of more lines of code than the code that fails. — qboomerang, Jul 04 '22 at 14:03

Multiprocessing Process.join() hangs

0 Answers0