6

I have some string processing job in Python. And I wish to speed up the job by using a thread pool. The string processing job has no dependency to each other. The result will be stored into a mongodb database.

I wrote my code as follow:

thread_pool_size = multiprocessing.cpu_count()
pool = ThreadPool(thread_pool_size)
for single_string in string_list:
    pool.apply_async(_process, [single_string ])
pool.close()
pool.join()

def _process(s):
    # Do staff, pure python string manipulation.
    # Save the output to a database (pyMongo).

I try to run the code in a Linux machine with 8 CPU cores. And it turns out that the maximum CPU usage can only be around 130% (read from top), when I run the job for a few minutes.

Is my approach correct to use a thread pool? Is there any better way to do so?

Markus W Mahlberg
  • 19,711
  • 6
  • 65
  • 89
Ivor Zhou
  • 1,241
  • 13
  • 22
  • Are you using the built in `multiprocessing` module or a separate module? – 101 Apr 28 '15 at 04:47
  • The built in one for Python 2.7. Thanks – Ivor Zhou Apr 28 '15 at 04:50
  • I removed the MongoDB tag for two reasons: first, the code shown has nothing to do with it. Second, the question is aimed at Pythons multiprocessing capabilities. Please refrain from adding your whole stack to the tags. – Markus W Mahlberg Apr 28 '15 at 07:30

2 Answers2

4

You might check using multiple processes instead of multiple threads. Here is a good comparison of both options. In one of the comments it is stated that Python is not able to use multiple CPUs while working with multiple threads (due to the Global interpreter lock). So instead of using a Thread pool you should use a Process pool to take full leverage of your machine.

Community
  • 1
  • 1
RaJa
  • 1,471
  • 13
  • 17
  • I used to thought ThreadPool is using process pool to implement in Python because it is in the multiprocessing package. But after I replace thread pool with process pool, the speed increase quite a lot. I will do more research on it and may update on the question if I got new findings. Thank you. – Ivor Zhou Apr 28 '15 at 09:26
2

Perhaps _process isn't CPU bound; it might be slowed by the file system or network if you're writing to a database. You could see if the CPU usage rises if you make your process truly CPU bound, for example:

def _process(s):
    for i in xrange(100000000):
        j = i * i
101
  • 8,514
  • 6
  • 43
  • 69
  • I use this function and find the CPU usage now goes up to around 300%, which is much better. But is it possible for me to increase it more? The theoretical limit for an 8-core CPU is 800% right? – Ivor Zhou Apr 28 '15 at 04:59