multiprocessing hanging at join

Question

Before anyone marks it as a duplicate question. I have been looking at StackOverflow posts for days, I haven't really found a good or satisfying answer.

I have a program that at some point will take individual strings (also many other arguments and objects), do some complicated processes on them, and spit 1 or more strings back. Because each string is processed separately, using multiprocessing seems natural here, especially since I work on machines with over 100 cores.

The following is a minimal example, which works with up to 12 to 15 cores, if I try to give it more cores, it hangs at p.join(). I know it's hanging at join because I tried to add some debug prints before and after join and it would stop at some point between the two print commands.

Minimal example:

import os, random, sys, time, string
import multiprocessing as mp

letters = string.ascii_uppercase
align_len = 1300

def return_string(queue):
    n_strings = [1,2,3,4]
    alignments = []

    # generating 1 to 4 sequences randomly, each sequence of length 1300
    # the original code might even produce more than 4, but 1 to 4 is an average case
    # instead of the random string there will be some complicated function called
    # in the original code
    for i in range(random.choice(n_strings)):
        alignment = ""
        for i in range(align_len):
            alignment += random.choice(letters)
        alignments.append(alignment)

    for a in alignments:
        queue.put(a)


def run_string_gen(cores):
    processes = []
    queue = mp.Queue()
    # running the target function 1000 time
    for i in range(1000):
        # print(i)
        process = mp.Process(target=return_string, args = (queue,))
        processes.append(process)
        if len(processes) == cores:
            counter = len(processes)
            for p in processes:
                p.start()

            for p in processes:
                p.join()

            while queue.qsize() != 0:
                a = queue.get()
                # the original idea is that instead of print
                # I will be writing to a file that is already open
                print(a)

            processes = []
            queue = mp.Queue()

    # any leftovers processes
    if processes:
        for p in processes:
            p.start()
        for p in processes:
            p.join()
        while queue.qsize() != 0:
            a = queue.get()
            print(a)

if __name__ == "__main__":
    cores = int(sys.argv[1])
    if cores > os.cpu_count():
        cores = os.cpu_count()
    start = time.perf_counter()
    run_string_gen(cores)
    print(f"it took {time.perf_counter() - start}")

The suspect is that the queue is getting full, but also it's not that many strings, when I give it 20 cores, it's hanging, but that's about 20*4=80 strings (if the choice was always 4), but is that many strings for the queue to get full?

Assuming the queue is getting full, I am not sure at which point I should check and empty it. Doing it inside return_string seems to be a bad idea as some other processes will also have the queue and might be emptying it/filling it at the same time. Do I use lock.acquire() and lock.release() then? These strings will be added to a file, so I can avoid using a queue and output the strings to a file. However, because starting a process means copying objects, I cannot pass a _io.TextIOWrapper object (which is an open file to append to) but I need to open and close the file inside return_string while syncing using lock.acquire() and lock.release(), but this seems wasteful to keep opening and closing the output file to write to it.

Some of the suggested solutions out there:

1- De-queuing the queue before joining is one of the answers I found. However, I cannot anticipate how long each process will take, and adding a sleep command after p.start() loop and before p.join() is bad (at least for my code), because if they finish fast and I end up waiting, that's just a lot of time wasted, and the whole idea is to have speed here.

2- Add some kind of sentinal character e.g. none to know if one worker finished. But didn't get this part, if I run the target function 10 times for 10 cores, I will have 10 sentinels, but the problems is that it's hanging and can't get to the queue to empty and check for sentinal.

Any suggestions or ideas on what to do here?

if you are using a shared data structure between threads, you should always make that data structure thread safe, add locks whenever you add or delete from the queue — RAZ_Muh_Taz, Dec 23 '21 at 17:14
Multiprocessing queue is thread-safe in its implementation [here](https://docs.python.org/3/library/multiprocessing.html#exchanging-objects-between-processes) and also this answer [here](https://stackoverflow.com/questions/34936948/is-python-multiprocessing-queue-thread-safe). When you call `.get()` and `.put()` it locks and releases. No need to manage it myself. However, even if I add to my code a lock where I acquire it and release it, it still hangs. — Fawaz.D, Dec 23 '21 at 17:18
If you know the processes are hanging, can you determine why they are hanging? — RAZ_Muh_Taz, Dec 23 '21 at 17:30
Not really, all I know so far is that when I use more than 15 cores, it's stopping at `p.join()` and this usually happens because `p.join()` is waiting for the function to finish, as you can see the function would only break at the `queue.put()`, there's no reason for it to break any other place, I am just generating a random string. So it seems that it has something to do with the `queue` getting full. — Fawaz.D, Dec 23 '21 at 17:33
The process is initialized with a target function, which is `return_string`, so it starts that function in a separate process. — Fawaz.D, Dec 23 '21 at 17:44
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/240396/discussion-between-fawaz-d-and-raz-muh-taz). Let's not spam the comments while not really answering the core questions — Fawaz.D, Dec 23 '21 at 17:52

Booboo · Accepted Answer · 2021-12-24T14:31:39.787

Read carefully the documentation for `multiprocessing.Queue. Read the second warning, which says in part:

Warning: As mentioned above, if a child process has put items on a queue (and it has not used JoinableQueue.cancel_join_thread), then that process will not terminate until all buffered items have been flushed to the pipe.

This means that if you try joining that process you may get a deadlock unless you are sure that all items which have been put on the queue have been consumed. Similarly, if the child process is non-daemonic then the parent process may hang on exit when it tries to join all its non-daemonic children.

In simple terms, your program violates this by joining the processes before it has read the items from the queue. You must reverse the order of operations. Then the problem becomes how does the main process know when to stop reading if the subprocesses are still running and writing to the queue. The simplest solution is for each subprocess to write a special sentinel record as the final item signaling that there are no more items that will be written by that process. The main process can then simply do blocking reads until it sees N sentinel records where N is the number of processes that it has started that will be writing to the queue. The sentinel record just has to be any unique record that cannot be mistaken for a normal item to be processed. None will suffice for that purpose:

import os, random, sys, time, string
import multiprocessing as mp

letters = string.ascii_uppercase
align_len = 1300

SENTINEL = None # no more records sentinel

def return_string(queue):
    n_strings = [1,2,3,4]
    alignments = []

    # generating 1 to 4 sequences randomly, each sequence of length 1300
    # the original code might even produce more than 4, but 1 to 4 is an average case
    # instead of the random string there will be some complicated function called
    # in the original code
    for i in range(random.choice(n_strings)):
        alignment = ""
        for i in range(align_len):
            alignment += random.choice(letters)
        alignments.append(alignment)

    for a in alignments:
        queue.put(a)
    # show this process is through writing records:
    queue.put(SENTINEL)


def run_string_gen(cores):
    processes = []
    queue = mp.Queue()
    # running the target function 1000 time
    for i in range(1000):
        # print(i)
        process = mp.Process(target=return_string, args = (queue,))
        processes.append(process)
        if len(processes) == cores:
            counter = len(processes)
            for p in processes:
                p.start()

            seen_sentinel_count = 0
            while seen_sentinel_count < len(processes):
                a = queue.get()
                if a is SENTINEL:
                    seen_sentinel_count += 1
                # the original idea is that instead of print
                # I will be writing to a file that is already open
                else:
                    print(a)

            for p in processes:
                p.join()

            processes = []
            # The same queue can be reused:
            #queue = mp.Queue()

    # any leftovers processes
    if processes:
        for p in processes:
            p.start()

        seen_sentinel_count = 0
        while seen_sentinel_count < len(processes):
            a = queue.get()
            if a is SENTINEL:
                seen_sentinel_count += 1
            else:
                print(a)

        for p in processes:
            p.join()

if __name__ == "__main__":
    cores = int(sys.argv[1])
    if cores > os.cpu_count():
        cores = os.cpu_count()
    start = time.perf_counter()
    run_string_gen(cores)
    print(f"it took {time.perf_counter() - start}")

Prints:

...
NEUNBZVXNHCHVIGNDCEUXJSINEJQNCOWBMUJRTIASUEJHDJUWZIYHHZTJJSJXALZHOEVGMHSVVMMIFZGLGLJDECEWSVZCDRHZWVOMHCDLJVQLQIQCVKBEVOVDWTMFPWIWIQFOGWAOPTJUWKAFBXPWYDIENZTTJNFAEXDVZHXHJPNFDKACCTRTOKMVDGBQYJQMPSQZKDNDYFVBCFMWCSCHTVKURPJDBMRWFQAYIIALHDJTTMSIAJAPLHUAJNMHOKLZNUTRWWYURBTVQHWECAFHQPOZZLVOQJWVLFXUEQYKWEFXQPHKRRHBBCSYZOHUDIFOMBSRNDJNBHDUYMXSMKUOJZUAPPLOFAESZXIETOARQMBRYWNWTSXKBBKWYYKDNLZOCPHDVNLONEGMALL
it took 32.7125509

Update

The same code done using a multiprocessing pool, which obviates having to re-create processes:

import os, random, sys, time, string
import multiprocessing as mp

letters = string.ascii_uppercase
align_len = 1300

SENTINEL = None # no more records sentinel

def return_string():
    n_strings = [1,2,3,4]
    alignments = []

    # generating 1 to 4 sequences randomly, each sequence of length 1300
    # the original code might even produce more than 4, but 1 to 4 is an average case
    # instead of the random string there will be some complicated function called
    # in the original code
    for i in range(random.choice(n_strings)):
        alignment = ""
        for i in range(align_len):
            alignment += random.choice(letters)
        alignments.append(alignment)

    return alignments


def run_string_gen(cores):
    def my_callback(result):
        alignments = result
        for alignment in alignments:
            print(alignment)

    pool = mp.Pool(cores)
    for i in range(1000):
        pool.apply_async(return_string, callback=my_callback)
    # wait for completion of all tasks:
    pool.close()
    pool.join()

if __name__ == "__main__":
    cores = int(sys.argv[1])
    if cores > os.cpu_count():
        cores = os.cpu_count()
    start = time.perf_counter()
    run_string_gen(cores)
    print(f"it took {time.perf_counter() - start}")

Prints:

...
OMCRIHWCNDKYBZBTXUUYAGCMRBMOVTDOCDYFGRODBWLIFZZBDGEDVAJAJFXWJRFGQXTSCCJLDFKMOENGAGXAKKFSYXEQOICKWFPSKOHIMCRATLVLVLMGFAWBDIJMZMVMHCXMTVJBSWXTLDHEWYHUMSQZGGFWRMOHKKKGMTFEOTTJDOQMOWWLKTOWHKCIUNINHTGUZHTBGHROPVKQBNEHQWIDCZUOJGHUXLLDGHCNWIGFUCAQAZULAEZPIP
it took 2.1607988999999996

Thanks a lot! I did read somewhere about the sentinel but I thought if I start emptying the queue before joining and I use queue.get() on an empty queue it would hang, but I was wrong, once there are items it empties them. I tested the adjustment on the original code with 50 cores and it finished. As for pool vs process, I still need to read a bit about this, my original function will take several objects in but it essentially does the same job on different objects, so maybe I should do it as a pool, I'll test both versions and compare memory and speed. — Fawaz.D, Dec 28 '21 at 02:52

HTF · Answer 2 · 2021-12-26T14:16:07.770

Note: the answer applies to Linux systems but I guess it will be similar on Windows.

The Queue is implemented using pipes and it seems you hit the capacity limit:

man pipe(7):

If a process attempts to read from an empty pipe, then read(2) will block until data is available. If a process attempts to write to a full pipe (see below), then write(2) blocks until sufficient data has been read from the pipe to allow the write to complete.

However the Python queue will just enqueue the data to the underlying buffer and the queue thread will block on the writes to the pipe.

The Process.join method also blocks so you have to start to consume the data from the queue before that. You can try to create a consumer process or just simplify your code by using Pool.

A simple test case to reproduce the issue with a single process:

test.py:

import logging
import multiprocessing as mp
import os


logger = mp.log_to_stderr()
logger.setLevel(logging.DEBUG)

def worker(q, n):
    q.put(os.urandom(2 ** n))


def main():
    q = mp.Queue()

    p = mp.Process(target=worker, args=(q, 17)) # > 65k bytes
    p.start()
    # p.join()


if __name__ == "__main__":
    main()

Test:

$ python test.py
[DEBUG/MainProcess] created semlock with handle 140292518252544
[DEBUG/MainProcess] created semlock with handle 140292517982208
[DEBUG/MainProcess] created semlock with handle 140292517978112
[INFO/MainProcess] process shutting down
[DEBUG/MainProcess] running all "atexit" finalizers with priority >= 0
[INFO/MainProcess] calling join() for process Process-1
[DEBUG/Process-1] Queue._after_fork()
[INFO/Process-1] child process calling self.run()
[DEBUG/Process-1] Queue._start_thread()
[DEBUG/Process-1] doing self._thread.start()
[DEBUG/Process-1] starting thread to feed data to pipe
[DEBUG/Process-1] ... done self._thread.start()
[INFO/Process-1] process shutting down
[DEBUG/Process-1] running all "atexit" finalizers with priority >= 0
[DEBUG/Process-1] telling queue thread to quit
[DEBUG/Process-1] running the remaining "atexit" finalizers
[DEBUG/Process-1] joining queue thread

As you can see above, it blocks when joining the queue thread because it can't write to the pipe:

$ sudo strace -ttT -f -p 218650
strace: Process 218650 attached with 2 threads
[pid 218650] 07:51:44.659503 write(4, "\277.\332)\334p\226\4e\202\3748\315\341\306\227`X\326\253\23m\25@:\345g-D\233\344$"..., 4096 <unfinished ...>
[pid 218649] 07:51:44.659563 futex(0x7fe3f8000b60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY

Once we read from the pipe on another terminal, the process terminates:

$ cat /proc/218650/fd/4 1> /dev/null

...
[DEBUG/Process-1] feeder thread got sentinel -- exiting
[DEBUG/Process-1] ... queue thread joined
[INFO/Process-1] process exiting with exitcode 0
[DEBUG/MainProcess] running the remaining "atexit" finalizers

That's not quite right, but you are close to the answer. The queue actually starts a thread to write to the pipe so that the `queue.put` will never block (only the thread that it starts will block) unless you have specified a size limit. The problem is that unless the process is writing to to the queue is a daemon process, it will never terminate until the thread writing to the pipe is able to complete its output. So the problem will manifest when that process is joined either explicitly or implicitly (as in your case at program termination). — Booboo, Dec 24 '21 at 14:08
The direct answer to the OP's question is that the OP's main process must do the queue.get() operations *before* doing the `join` operations, which you eventually do say but your solution is not the simplest change to the existing code. But I am still upvoting. — Booboo, Dec 24 '21 at 14:09
@Booboo yes, that's what I meant but I just didn't explain it well. The `queue.put` will just enqueue the items and call `notify` on the `threading.Condition` — HTF, Dec 24 '21 at 19:32
I see, I did not know about the pipe connection to queue. I guess I need to read up more on that before screwing up more things :-p but thanks a lot for the info. — Fawaz.D, Dec 26 '21 at 10:41

multiprocessing hanging at join

2 Answers2