Python ThreadPool from multiprocessing.pool cannot ultilize all CPUs

Question

I have some string processing job in Python. And I wish to speed up the job by using a thread pool. The string processing job has no dependency to each other. The result will be stored into a mongodb database.

I wrote my code as follow:

thread_pool_size = multiprocessing.cpu_count()
pool = ThreadPool(thread_pool_size)
for single_string in string_list:
    pool.apply_async(_process, [single_string ])
pool.close()
pool.join()

def _process(s):
    # Do staff, pure python string manipulation.
    # Save the output to a database (pyMongo).

I try to run the code in a Linux machine with 8 CPU cores. And it turns out that the maximum CPU usage can only be around 130% (read from top), when I run the job for a few minutes.

Is my approach correct to use a thread pool? Is there any better way to do so?

Are you using the built in `multiprocessing` module or a separate module? — 101, Apr 28 '15 at 04:47
I removed the MongoDB tag for two reasons: first, the code shown has nothing to do with it. Second, the question is aimed at Pythons multiprocessing capabilities. Please refrain from adding your whole stack to the tags. — Markus W Mahlberg, Apr 28 '15 at 07:30

score 4 · Answer 1 · edited May 23 '17 at 11:50

4

You might check using multiple processes instead of multiple threads. Here is a good comparison of both options. In one of the comments it is stated that Python is not able to use multiple CPUs while working with multiple threads (due to the Global interpreter lock). So instead of using a Thread pool you should use a Process pool to take full leverage of your machine.

edited May 23 '17 at 11:50

Community

1
1

answered Apr 28 '15 at 08:15

RaJa

1,471
13
17

I used to thought ThreadPool is using process pool to implement in Python because it is in the multiprocessing package. But after I replace thread pool with process pool, the speed increase quite a lot. I will do more research on it and may update on the question if I got new findings. Thank you. – Ivor Zhou Apr 28 '15 at 09:26

101 · Answer 2 · 2015-04-28T04:49:38.940

2

Perhaps _process isn't CPU bound; it might be slowed by the file system or network if you're writing to a database. You could see if the CPU usage rises if you make your process truly CPU bound, for example:

def _process(s):
    for i in xrange(100000000):
        j = i * i

edited Apr 28 '15 at 04:49

answered Apr 28 '15 at 04:42

101

8,514
6
43
69

I use this function and find the CPU usage now goes up to around 300%, which is much better. But is it possible for me to increase it more? The theoretical limit for an 8-core CPU is 800% right? – Ivor Zhou Apr 28 '15 at 04:59

Python ThreadPool from multiprocessing.pool cannot ultilize all CPUs

2 Answers2