I'm trying to speed up my code using ThreadPool
.
The input is a dictionary with about 80000 elements each one of which is a list of 25 elements. I have to produce an output list for each element in the dictionary, by processing and combining the elements in each list.
All the lists can be analyzed independently, so this setting should be easily parallelizable.
Here's the basic setting I used for pool.map
:
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(NUM_THREADS)
output = pool.map(thread_compute, iterable_list)
pool.close()
pool.join
Approach 1 (wrong): define the dictionary as global and have each thread get a key of the dictionary as input.
# in the main global input_dict iterable_list = input_dict.keys() # thread_compute function def thread_compute(key_i): list_to_combine = input_dict[key_i] # call a function processed_list = do_stuff_function(list_to_combine) return processed_list
I soon realized approach 1 will not work because of the global variable input_dict
, even though it's never accessed for write operations (and so should be thread safe), is protected by the GIL
(link 1, link 2) - a globally enforced lock when trying to safely access Python objects from within seperate threads.
Approach 2 (not working): I created a list containing the same elements as
input_dict
where each entry is a list of 25 items. This list is not a global variable and each thread should be able to access an element (a 25 item list) without any overhead due to theGIL
.# in the main items = list(input_dict.items()) iterable_list = [items[i][1] for i in range(len(items))] # making sure the content is correct assert(len(iterable_list) == len(input_dict)) for i in xrange(len(input_dict.keys())): assert(iterable_list[i] == input_dict[input_dict.keys()[i]]) # thread_compute function def thread_compute(list_of_25_i): # call a function processed_list = do_stuff_function(list_of_25_i) return processed_list
Here are the execution times for 1, 2, 4 and 16 threads:
1: Done, (t=36.6810s).
2: Done, (t=46.5550s).
4: Done, (t=48.2722s).
16: Done, (t=48.9660s).
Why does adding threads cause such an increase in time? I am definitely sure that this problem can benefit from multithreading, and think that the overhead of adding threads cannot be solely responsible of the increase.