Python Multiprocessing combined with Multithreading

Question

I am not sure if what i am trying to do is a valid practice but here it goes: I need my program to be highly parallelized so i thought i could make 2-3 processes and each process can have 2-3 threads.

1) Is this possible? 2) Is there any point into it? 3) This is my code but it hangs when i try to join the processes.

PQ = multiprocessing.Queue()

[...]

def node(self, files, PQ):

        l1, l2 = self.splitList(files)
        p1 = multiprocessing.Process(target=self.filePro, args=(l1,PQ,))
        p2 = multiprocessing.Process(target=self.filePro, args=(l2,PQ,))
        p1.daemon = True
        p2.daemon = True
        p1.start()
        p2.start()

        p1.join() # HANGS HERE
        p2.join()
        while 1:
            if PQ.empty():
                break
            else:
                print(PQ.get())
        PQ.join()

    def filePro(self,lst,PQ):
        TQ = queue.Queue()
        l1, l2 = self.splitList(lst)
        t1 = threading.Thread(target=self.fileThr, args=('a',l1,TQ,))
        t2 = threading.Thread(target=self.fileThr, args=('b',l2,TQ,))
        t1.daemon = True
        t2.daemon = True
        t1.start()
        t2.start()

        t1.join()
        t2.join()
        while 1:
            if TQ.empty():
                break
            else:
                PQ.put(TQ.get())
                TQ.task_done()
        TQ.join()

def fileThr(self,id,lst,TQ):
        while lst:
            tmp_path = lst.pop()
            if (not tmp_path[1]):
                continue
            for item in tmp_path[1]:
                TQ.put(1)
        TQ.join()

I use processes when I need to maximize cpu usage, I use threads when I have blocking operations like disk access, network etc. So if I had a script to download many files I'd create a pool of threads and use it. If I had a distributed calculation that peaks cpu I'd use a pool of processes. — Reut Sharabani, Dec 13 '14 at 03:52
We need a [minimal, complete, verifiable example](http://stackoverflow.com/help/mcve) if you want us to debug your code. — abarnert, Dec 13 '14 at 04:21

score 32 · Accepted Answer · edited May 23 '17 at 12:09

1) Is this possible?

Yes.

2) Is there any point into it?

Yes. But generally not the point you're looking for.

First, just about every modern operating system uses a "flat" scheduler; there's no difference between 8 threads scattered across 3 programs or 8 thread across 8 programs.*

_{* Some programs can get a significant benefit by carefully using intraprocess-only locks or other synchronization primitives in some places where you know you're only sharing with threads from the same program—and, of course, by avoiding shared memory in those places—but you're not going to get that benefit by spreading your jobs across threads and your threads across processes evenly.}

Second, even if you were using, say, old SunOS, in the default CPython interpreter, the Global Interpreter Lock (GIL) ensures that only one thread can be running Python code at a time. If you're spending your time running code from a C extension library that explicitly releases the GIL (like some NumPy functions), threads can help, but otherwise, they all just end up serialized anyway.

The main case where threads and processes are useful together is where you have both CPU-bound and I/O-bound work. In that case, usually one is feeding the other. If the I/O feeds the CPU, use a single thread pool in the main process to handle the I/O, then use a pool of worker processes to do the CPU work on the results. If it's the other way around, use a pool of worker processes to do the CPU work, then have each worker process use a thread pool to do the I/O.

3) This is my code but it hangs when i try to join the processes.

It's very hard to debug code when you don't give a minimal, complete, verifiable example.

However, I can see one obvious problem.

You're trying to use TQ as a producer-consumer queue, with t1 and t2 as producers and the filePro parent as the consumer. Your consumer doesn't call TQ.task_done() until after t1.join() and t2.join() return, which doesn't happen until those threads are done. But those producers won't finish because they're waiting for you to call TQ.task_done(). So, you've got a deadlock.

And, because each of your child processes' main threads are deadlocked, they're never finish, so the p1.join() will block forever.

If you really want the main thread to wait until the other threads are done before doing any work, you don't need the producer-consumer idiom; just let the children do their work and exit without calling TQ.join(), and don't bother with TQ.task_done() in the parent. (Note that you're already doing this correctly with PQ.)

If, on the other hand, you want them to work in parallel, don't try to join the child threads until you've finished your loop.

Thanks! That was a very complete answer, however now i have 1 more question regarding your 2nd answer. 1) about the GIL, does it mean that if i spawn 30 threads it will be the same as spawning 1? because you said that they end up being serialized anyway... — Angelo Uknown, Dec 13 '14 at 11:42
@AngeloUknown: No, it's not quite _the same_. You don't get *parallelism*. That is, even if you have 32 cores, using 30 threads won't run any faster than using 1. But you do get *concurrency*. Threads automatically interleave work between your tasks—and they can automatically take care of blocking—e.g., if one thread is waiting on I/O, the system will schedule a different one to run instead of blocking your whole program. Unless you write code that _explicitly_ deadlocks (as in your example), one thread won't block another from progressing. — abarnert, Dec 17 '14 at 23:47
@AngeloUknown: I couldn't find a good source that discusses the difference in Python-specific terms, but [Parallelism vs. Concurrency](https://www.haskell.org/haskellwiki/Parallelism_vs._Concurrency) on the Haskell wiki is a pretty nice overview if you ignore the Haskell-specific stuff. — abarnert, Dec 17 '14 at 23:48
You have mentioned that - "the Global Interpreter Lock (GIL) ensures that only one thread can be running Python code at a time". Do you mean one GIL per process? — variable, Nov 06 '19 at 06:42

michalmonday · Answer 2 · 2021-12-05T22:29:18.420

I compared behaviour of the following 3 approaches on a IO+CPU and strictly CPU expensive blocking task:

multiprocessing only
multithreading only
both combined using fast_map function

Results for IO+CPU expensive tasks show significant speed improvement when combination of multiprocessing and multithreading is used. "-1" indicates that the ProcessPoolExecutor failed due to "too many files" open.

Results for strictly CPU expensive tasks show that multiprocessing itself is slightly faster.

fast_map function spawns a process for each cpu-core*2 and creates sufficient number of threads in each process to achieve full concurrency (unless threads_limit argument is supplied). Source code, testing code are more information is available from the fast_map GitHub page.

If someone wants to play around with it or just practically use it, it can be obtained with:

python3 -m pip install fast_map

And used like:

from fast_map import fast_map
import time

def wait_and_square(x):
    time.sleep(1)
    return x*x

for i in fast_map(wait_and_square, range(8), threads_limit=None):
    print(i)

score 0 · Answer 3 · answered Jan 21 '22 at 04:14

Better not.

If you combine multiprocessing with multithreading in fork "start methods", you need to ensure your parent process "fork safe". The fork() only copy the calling thread, it causes deadlock easily.

Maybe you can replace multiprocessing.Queue with ZeroMQ, and run multiple python interpreters manually.

Is it safe to fork from within a thread?

Python Multiprocessing combined with Multithreading

3 Answers3

Linked