37

I am reading various tutorials on the multiprocessing module in Python, and am having trouble understanding why/when to call process.join(). For example, I stumbled across this example:

nums = range(100000)
nprocs = 4

def worker(nums, out_q):
    """ The worker function, invoked in a process. 'nums' is a
        list of numbers to factor. The results are placed in
        a dictionary that's pushed to a queue.
    """
    outdict = {}
    for n in nums:
        outdict[n] = factorize_naive(n)
    out_q.put(outdict)

# Each process will get 'chunksize' nums and a queue to put his out
# dict into
out_q = Queue()
chunksize = int(math.ceil(len(nums) / float(nprocs)))
procs = []

for i in range(nprocs):
    p = multiprocessing.Process(
            target=worker,
            args=(nums[chunksize * i:chunksize * (i + 1)],
                  out_q))
    procs.append(p)
    p.start()

# Collect all results into a single result dict. We know how many dicts
# with results to expect.
resultdict = {}
for i in range(nprocs):
    resultdict.update(out_q.get())

# Wait for all worker processes to finish
for p in procs:
    p.join()

print resultdict

From what I understand, process.join() will block the calling process until the process whose join method was called has completed execution. I also believe that the child processes which have been started in the above code example complete execution upon completing the target function, that is, after they have pushed their results to the out_q. Lastly, I believe that out_q.get() blocks the calling process until there are results to be pulled. Thus, if you consider the code:

resultdict = {}
for i in range(nprocs):
    resultdict.update(out_q.get())

# Wait for all worker processes to finish
for p in procs:
    p.join()

the main process is blocked by the out_q.get() calls until every single worker process has finished pushing its results to the queue. Thus, by the time the main process exits the for loop, each child process should have completed execution, correct?

If that is the case, is there any reason for calling the p.join() methods at this point? Haven't all worker processes already finished, so how does that cause the main process to "wait for all worker processes to finish?" I ask mainly because I have seen this in multiple different examples, and I am curious if I have failed to understand something.

David Cain
  • 16,484
  • 14
  • 65
  • 75
Justin
  • 1,226
  • 4
  • 18
  • 21
  • 3
    I believe not calling join **could** leave subprocesses in zombie state on some systems, thus calling `p.join()` is a "best-practice" to be sure that the subprocesses are finalized correctly. In this simple cases you may not see the difference, but in more complex situations you could run into problems if the number of zombies gets bigger and bigger. – Bakuriu Jan 20 '13 at 21:55
  • What in practice would happen if you joined a zombified process? Would the calling process be blocked indefinitely and would signal that you have a problem? Or would it start some clean-up, i.e. force the zombie process to be purged from the process table? – Justin Jan 20 '13 at 21:58
  • It would force the SO to remove the zombie process from the process table. See: http://en.wikipedia.org/wiki/Zombie_process – Bakuriu Jan 20 '13 at 22:05

3 Answers3

23

At the point just before you call join, all workers have put their results into their queues, but they did not necessarily return, and their processes may not yet have terminated. They may or may not have done so, depending on timing.

Calling join makes sure that all processes are given the time to properly terminate.

oefe
  • 19,298
  • 7
  • 47
  • 66
  • I'm willing to accept this as an answer, although I'm still curious: what are the consequences of not joining? I mean, in this case, each worker has finished execution, all results have been pulled from the queue, so resultdict is going to have the solution. If the main process then happens to terminate before the worker processes do, shouldn't the worker processes still complete their termination? Will not joining cause the worker processes to be zombified? – Justin Jan 20 '13 at 22:12
  • @Justin if you cannot ensure that the worker process are terminated, you cannot say that resultdict has the solution in it. When spawning processes, you have no imposing on scheduling of them. – Seçkin Savaşçı Jan 20 '13 at 23:04
  • Do you say that in direct reference to the example code, or do you mean that in general? I agree that in general, that is true. However, my main concern was the above example, in which it seems quite clear to me that the main thread blocks until every child process has finished execution and has pushed its result to the result queue. After the main thread empties the result queue, each child process should unblock and be allowed to terminate, and as such, it was confusing to me as to why joins were made at that moment. I agree that this would be useful in different situations. – Justin Jan 21 '13 at 07:00
  • The workers don't terminate with the out_q.put call. Thy still have to return from the worke function, nd probably several further functions on the call stack, and to execute process shutdown code, this takes some time, even if its short – oefe Jan 22 '13 at 22:23
  • Why not call `terminate()` instead of `join()` if putting to queue is the last thing each process does? – young_souvlaki Jul 14 '23 at 16:22
22

Try to run this:

import math
import time
from multiprocessing import Queue
import multiprocessing

def factorize_naive(n):
    factors = []
    for div in range(2, int(n**.5)+1):
        while not n % div:
            factors.append(div)
            n //= div
    if n != 1:
        factors.append(n)
    return factors

nums = range(100000)
nprocs = 4

def worker(nums, out_q):
    """ The worker function, invoked in a process. 'nums' is a
        list of numbers to factor. The results are placed in
        a dictionary that's pushed to a queue.
    """
    outdict = {}
    for n in nums:
        outdict[n] = factorize_naive(n)
    out_q.put(outdict)

# Each process will get 'chunksize' nums and a queue to put his out
# dict into
out_q = Queue()
chunksize = int(math.ceil(len(nums) / float(nprocs)))
procs = []

for i in range(nprocs):
    p = multiprocessing.Process(
            target=worker,
            args=(nums[chunksize * i:chunksize * (i + 1)],
                  out_q))
    procs.append(p)
    p.start()

# Collect all results into a single result dict. We know how many dicts
# with results to expect.
resultdict = {}
for i in range(nprocs):
    resultdict.update(out_q.get())

time.sleep(5)

# Wait for all worker processes to finish
for p in procs:
    p.join()

print resultdict

time.sleep(15)

And open the task-manager. You should be able to see that the 4 subprocesses go in zombie state for some seconds before being terminated by the OS(due to the join calls):

enter image description here

With more complex situations the child processes could stay in zombie state forever(like the situation you was asking about in an other question), and if you create enough child-processes you could fill the process table causing troubles to the OS(which may kill your main process to avoid failures).

Community
  • 1
  • 1
Bakuriu
  • 98,325
  • 22
  • 197
  • 231
  • Interesting, okay. I definitely get that in other cases it could be a problem, but it seems that in this case it might be a bit superfluous because the parent is blocked until the children finish executing anyway. Thanks for the clarification! – Justin Jan 20 '13 at 22:34
1

I am not exactly sure of the implementation details, but join also seems to be necessary to reflect that a process has indeed terminated (after calling terminate on it for example). In the example here, if you don't call join after terminating a process, process.is_alive() returns True, even though the process was terminated with a process.terminate() call.

firedrillsergeant
  • 695
  • 1
  • 8
  • 20