Multithreaded python code doesn't utilize CPU effectively

Question

I am writing a pipeline to slides pictures into 256 * 256, each of those 256 * 256 will be processed with image operation like right flipping, left flipping, elastic distortion, gamma correction, etc. The operations itself are not implemented by me but Numpy, Skiimage or OpenCV, so the problem can not be the operations themselves.

My idea is to create a thread pool of 24 threads, each of them will get an initial amount of images which they should process independently of each other, after the processing I will collect the result and return them back. However my code doesn't seem to utilize the CPU power very well.

The implementation of a single thread.

class ImageWorker(Thread):
    def __init__(self):
        Thread.__init__(self)
        self.tasks = []
        self.result = []
        self.pipeline = get_pipeline()

    def add_task(self, task):
        self.tasks.append(task)

    def run(self):
        for _ in range(len(self.tasks)):
            task = self.tasks.pop(0)
            for p in self.pipeline:
                result = p.do(task)
                self.result.append(result)

The implementation of a thread pool

class ImageWorkerPool:
    def __init__(self, num_threads):
        self.workers = []
        self.work_index = 0
        for _ in range(num_threads):
            self.workers.append(ImageWorker())

    def add_task(self, task):
        self.workers[self.work_index].add_task(task)
        self.work_index += 1
        self.work_index = self.work_index % len(self.workers)
        assert self.work_index < len(self.workers)

    def start(self):
        for worker in self.workers:
            worker.start()

    def complete_and_return_result(self):
        for worker in self.workers:
            worker.join()
        result = []
        for worker in self.workers:
            result.extend(worker.result)
        return result

And this is how I create and populate a thread pool.

    threadpool = ImageWorkerPool(num_threads=24)
    for _ in tqdm(range(len(images)), desc="Augmentation"):
        task = tasks.pop(0)
        threadpool.add_task(task)

    threadpool.start()
    result = threadpool.complete_and_return_result()

I have a very beefy CPU with 24 Threads, but they are mostly utilized at 10% most. What is the problem?

Edited: After changing from multithreading to multiprocessing, this is how the performance looks like. The code finished after 20 seconds in comparison to 15 minutes with multithreading. Thanks, @AMC and @quamrana

Why are you not using the builtin `ThreadPoolExecutor`: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor — rdas, Mar 10 '20 at 21:40
I am not very familiar with multithreading in python, will there be any performance boost with ThreadPoolExecutor? — curiouscupcake, Mar 10 '20 at 21:42
Have you considered the fact that your app is limited by the I/O and not by the CPU power ? — Emmanuel BERNAT, Mar 10 '20 at 21:43
All images will be loaded into the RAM before being processed and are stored into a list before being distributed into threads. So I am pretty sure IO is no the bottleneck. — curiouscupcake, Mar 10 '20 at 21:44
Why would you write your own thread pool if there is already one provided? And it's fair to say that ThreadPoolExecutor will be more stable than your threadpool — rdas, Mar 10 '20 at 21:45
_However my code doesn't seem to utilize the CPU power very well._ That's because you're using threading, no? Why not use multiprocessing instead? — AMC, Mar 10 '20 at 21:49
If cpu is the bottle-neck, then Threads are not the way. Consider Multiprocessing. — quamrana, Mar 10 '20 at 21:50
@AMC This might sound like a stupid question but why is multiprocessing more performant than multithreading? — curiouscupcake, Mar 10 '20 at 21:52
@LongNguyen _This might sound like a stupid question but why is multiprocessing more performant than multithreading?_ Don't worry, it's not stupid! I wouldn't say that one has better performance than the other in general, they're not really meant to be used for the same tasks so it's apples to oranges. There are many solid resources on the subject, a popular question here on SO is https://stackoverflow.com/questions/3044580/multiprocessing-vs-threading-python. — AMC, Mar 10 '20 at 21:54
Threading suffers the GIL, whereas Multiprocessing uses real processes which can each run simultaneously on cpu cores. — quamrana, Mar 10 '20 at 21:54

score 1 · Accepted Answer · answered Mar 11 '20 at 00:26

This is well explained in many articles. The main culprit is the GIL (Global interpreter lock)

Very shortly explained. even with multiple CPUs and threads only one python byte code can be executed at one time, as execution of python bytecode uses the GIL (a mutext) Threading in python makes only sense if you use python modules written in C, that release the GIL or if most threads are suspended (waiting for IO).

The solution is as others mentioned to use the module multiprocessing or to use another language.

I suggest to search within SO for follwing keywords to get some insight:

python multithreading gil performance

Multithreaded python code doesn't utilize CPU effectively

1 Answers1