2

I'm trying to speed up my code using ThreadPool.

The input is a dictionary with about 80000 elements each one of which is a list of 25 elements. I have to produce an output list for each element in the dictionary, by processing and combining the elements in each list.

All the lists can be analyzed independently, so this setting should be easily parallelizable.

Here's the basic setting I used for pool.map:

from multiprocessing.dummy import Pool as ThreadPool

pool   = ThreadPool(NUM_THREADS)
output = pool.map(thread_compute, iterable_list)
pool.close()
pool.join
  • Approach 1 (wrong): define the dictionary as global and have each thread get a key of the dictionary as input.

    # in the main
    global input_dict
    iterable_list = input_dict.keys()
    
    # thread_compute function
    def thread_compute(key_i):
        list_to_combine = input_dict[key_i]
        # call a function
        processed_list = do_stuff_function(list_to_combine)
        return processed_list
    

I soon realized approach 1 will not work because of the global variable input_dict, even though it's never accessed for write operations (and so should be thread safe), is protected by the GIL (link 1, link 2) - a globally enforced lock when trying to safely access Python objects from within seperate threads.

  • Approach 2 (not working): I created a list containing the same elements as input_dict where each entry is a list of 25 items. This list is not a global variable and each thread should be able to access an element (a 25 item list) without any overhead due to the GIL.

    # in the main
    items = list(input_dict.items())
    iterable_list = [items[i][1] for i in range(len(items))]
    
    # making sure the content is correct
    assert(len(iterable_list) == len(input_dict))
    for i in xrange(len(input_dict.keys())):
        assert(iterable_list[i] == input_dict[input_dict.keys()[i]])
    
    # thread_compute function
    def thread_compute(list_of_25_i):
        # call a function
        processed_list = do_stuff_function(list_of_25_i)
        return processed_list
    

Here are the execution times for 1, 2, 4 and 16 threads:

   1:  Done, (t=36.6810s).
   2:  Done, (t=46.5550s).
   4:  Done, (t=48.2722s).
   16: Done, (t=48.9660s).

Why does adding threads cause such an increase in time? I am definitely sure that this problem can benefit from multithreading, and think that the overhead of adding threads cannot be solely responsible of the increase.

Community
  • 1
  • 1
Matteo
  • 7,924
  • 24
  • 84
  • 129
  • See [here](http://stackoverflow.com/questions/4748787/python-threading-and-gil) as a start, and there are plenty of other things thrown up by "python multithreading and gil". I understand mostly, but my background outside of CS means I wouldn't even attempt to try give a technical answer :) I think you want something like multiprocessing instead – roganjosh Sep 27 '16 at 15:52
  • 2
    *because of the global variable input_dict, even though is never accessed for write operations [...], is protected by the GIL - a globally enforced lock when trying to safely access Python objects from within threads.* You misunderstand the GIL. The GIL doesn't protect Python objects. It protects the *internal interpreter state* and ensures that the interpreter loop is not pre-empted in the middle of executing a bytecode. – Martijn Pieters Sep 27 '16 at 15:54
  • In other words, all the GIL does is make sure that all thread switches take place in between two byte-code interpretation steps. What those bytecode instructions do to Python objects is well outside the purview of the GIL. Two Python threads can comfortably read from the same global `input_dict` dictionary, there is no lock preventing you from doing so. – Martijn Pieters Sep 27 '16 at 16:01
  • @Matteo: both answers tell you the exact same thing I tell you in the other answer: use multiprocessing. The library lets you share your dictionary among the processes. – Martijn Pieters Sep 27 '16 at 16:10
  • @MartijnPieters - I just read all the answers and comments from this question and the one you marked as duplicate. To make sure I understood correctly, if my `do_stuff_function` was heavily `I/O` or `network` bound then the code would benefit from multithreading, as the CPU could do other stuff while a thread is stuck waiting. Since the `do_stuff_function` is instead CPU bound I do not benefit at all from multi-threading. Now, if I wrote and called a C function that did the same exact computations, would multithreading suddenly work? I believe the answer is yes. Thanks! – Matteo Sep 27 '16 at 16:31
  • @Matteo: exactly; provided your C function is Python aware and [releases the GIL](https://docs.python.org/3/c-api/init.html#releasing-the-gil-from-extension-code) properly. – Martijn Pieters Sep 27 '16 at 16:34

2 Answers2

2

If your do_stuff_function is CPU-bound, then running it in multiple thread will not help, because the GIL only allows 1 thread to be executed at a time.

The way around this in Python is to use multiple process, just replace

from multiprocessing.dummy import Pool

with

from multiprocessing import Pool
Kien Truong
  • 11,179
  • 2
  • 30
  • 36
  • It was my understanding though that doing something like this actually just creates a new `dict` object for each process. In which case, rather than reduce any processing effort, you just have each process working on its own copy? In my adventures with multiprocessing, I used a `manager()` to give each process access to the same object in memory – roganjosh Sep 27 '16 at 15:57
  • Or, that some preprocessing was needed on the `dict` to split it into pieces such that each piece could be passed to a different process to remove duplication of effort – roganjosh Sep 27 '16 at 16:00
  • The GIL locks the interpreter loop and only allows 1 **Python thread** to run. Threads that are executing C code (I/O, computationally expensive external libraries, etc.) are free to unlock the GIL so that another CPU core can continue to execute Python bytecode. So if `do_stuff_function()` could be delegated to a library (writting in C, say), then threads can be used just fine. – Martijn Pieters Sep 27 '16 at 16:04
  • @MartijnPieters is my assumption correct though that simply passing the object to a multiprocessing `Pool` would give each process their copy of the whole dictionary in memory, thereby making each process do the same amount of work multiple times rather than serving to divide and conquer the processing? From what I read, that happens when you spawn multiple processes, but I couldn't quite get my head around how a `Pool` could behave differently. – roganjosh Sep 27 '16 at 16:07
  • 1
    @roganjosh: yes, a naive sharing of the dict would require it to be shared among the processes in whole. Better use a queue and a central process that collects results. – Martijn Pieters Sep 27 '16 at 16:11
  • @MartijnPieters many thanks for the confirmation, I really struggled to understand the purpose of the `Pool` at all considering that limitation and did indeed go with a `queue`/`manager` for my CPU-bound task, but I assume now it's for things that are primarily non-CPU-bound. – roganjosh Sep 27 '16 at 16:16
1

Try using Pool (docs) as it uses processes instead of threads.

  • Threads in Python are concurrent as opposed to parallel which basically means that the interpreter will switch very fast between executing several threads and while executing one, lock the interpreter for all other threads, giving the impression of running operations parallel.

  • Processes on the other hand are much more costly to spawn in terms of time, as they are basically own instances of the interpreter and the data they operate on must be serialized, and sent from the main to the worker process. There it's deserialized, computed, results are serialized and it is sent back again.

However, the time it takes to spawn processes might have a negative impact on the performance overall, considering the moderate processing times you posted.

timmwagener
  • 2,368
  • 2
  • 19
  • 27