Optimizing parsing of massive python dictionary, multi-threading

Question

Let's take a small example python dictionary, where the values are lists of integers.

example_dict1 = {'key1':[367, 30, 847, 482, 887, 654, 347, 504, 413, 821],
    'key2':[754, 915, 622, 149, 279, 192, 312, 203, 742, 846], 
    'key3':[586, 521, 470, 476, 693, 426, 746, 733, 528, 565]}

Let's say I need to parse the values of the lists, which I've implemented into the following function:

def manipulate_values(input_list):
    return_values = []
    for i in input_list:
        new_value = i ** 2 - 13
        return_values.append(new_value)
    return return_values

Now, I can easily parse the values of this dictionary as follows:

for key, value in example_dict1.items():
    example_dict1[key] = manipulate_values(value)

resulting in the following:

example_dict1 = {'key1': [134676, 887, 717396, 232311, 786756, 427703, 120396, 254003, 170556, 674028], 
     'key2': [568503, 837212, 386871, 22188, 77828, 36851, 97331, 41196, 550551, 715703], 
     'key3': [343383, 271428, 220887, 226563, 480236, 181463, 556503, 537276, 278771, 319212]}

That works very well for small dictionaries.

My problem is, I have a massive dictionary with millions of keys and long lists. If I were to apply the above approach, the algorithm would be prohibitively slow.

How could I optimize the above?

(1) Multithreading---are there more efficient options available for multithreading this for statement in the dictionary besides the traditional threading module?

(2) Would a better data structure be appropriate?

I'm asking this question as, I'm quite stuck how to best proceed in this case. I don't see a better data structure than a dictionary, but the for loops across the dictionary (and then across the value lists) is quite slow. There may be something here which has been designed to be faster.

EDIT: As you can imagine, this is somewhat of a toy example---the function in question is a bit more complicated than x**2-13.

I'm more interested in how to possibly worth with a dictionary with millions of keys, with long lists of values.

Do you have enough memory to store everything in a numpy array? — marcos, Mar 09 '20 at 01:04
threading won't help you because the python global interpreter lock (GIL) enforces cooperative multithreading at the python level - only one thread can run at a time so no parallelism — tdelaney, Mar 09 '20 at 01:17
People have mentioned converting to numpy. Its even more efficient if those lists are created in numpy in the first place. — tdelaney, Mar 09 '20 at 01:54
If you use numpy, it releases the GIL and you can potentially feed the processing off to a thread pool. Its usually not worth the performance penalty of the pool management. — tdelaney, Mar 09 '20 at 01:56
@tdelaney There's no way one could use multithreading here at all? Isn't there someway to "split" the dictionary, do the calculation above by a single thread, then join together---sort of a map/reduce approach? — EB2127, Mar 09 '20 at 05:50
As a more general question, isn't it possible to use parallelization here? See for example: https://stackoverflow.com/questions/30060088/python-how-to-parallelize-a-loop-with-dictionary — EB2127, Mar 09 '20 at 05:56
Not within a single process. As long as we are doing pure python, only one thread runs at a time. One python thread grabs the global interpreter lock (its needed because variable manipulation is not thread safe), blocking all other python threads. Its reacquired every 5 mS just in case other python threads are running. If you have 10 threads doing calculations, 9 are hung on the lock at all times. — tdelaney, Mar 09 '20 at 07:58
The other problem is that variables in python are relatively heavy weight both in size and access time (they are wrapped in an object header) compared to vanilla C. That's why C extensions like numpy are so useful. — tdelaney, Mar 09 '20 at 08:01
Another option is to code the slow stuff in [`cython`](https://cython.readthedocs.io/en/latest/) which integrates easily with python but is faster. — tdelaney, Mar 09 '20 at 08:08
@EB2127 You could use multiprocessing in Python, but there’s no guarantee it will significantly improve performance. It might be worth comparing a pure NumPy approach, just multiprocessing, and numpy with multiprocessing. — AMC, Mar 09 '20 at 12:21

score 4 · Accepted Answer · answered Mar 09 '20 at 01:36

4

If you can store everything inside a numpy array processing will be faster. I increased the size of each list by a factor of 0.5 millions to test scalability, and these are my results:

from timeit import timeit
import numpy as np

n = 500000
example_dict1 = {'key1':[367, 30, 847, 482, 887, 654, 347, 504, 413, 821]*n,
    'key2':[754, 915, 622, 149, 279, 192, 312, 203, 742, 846]*n, 
    'key3':[586, 521, 470, 476, 693, 426, 746, 733, 528, 565]*n}

def manipulate_values(input_list):
    return_values = []
    for i in input_list:
        new_value = i ** 2 - 13
        return_values.append(new_value)
    return return_values

With your method:

for_with_dictionary = timeit("""
for key, value in example_dict1.items():
    example_dict1[key] = manipulate_values(value)
""", "from __main__ import example_dict1,manipulate_values ",number=5)

print(for_with_dictionary)

>>> 33.2095841

With numpy:

numpy_broadcasting = timeit("""
array = np.array(list(example_dict1.values()))
array = array ** 2 - 13
""", "from __main__ import example_dict1, np",number=5)
print(numpy_broadcasting)

>>> 5.039885

There is a significant upgrade in speed, at least 6 times.

answered Mar 09 '20 at 01:36

marcos

4,473
1
10
24

How much of those 5 seconds is the conversion and how much is the calculation? – Kelly Bundy Mar 09 '20 at 01:45
1

And the calculation can be more space efficient with inplace operations `array**=2;array-=13` – tdelaney Mar 09 '20 at 01:53
@Marcos Thanks for this. I'm curious how this translates for larger number of dictionary keys---in reality, the dictionary has millions of keys. – EB2127 Mar 09 '20 at 05:45
Another question: Is it possible to do multithreading using the NumPy framework? One could possibly try this by having each thread use a unique set of keys, e.g. the answer here perhaps: https://stackoverflow.com/questions/30060088/python-how-to-parallelize-a-loop-with-dictionary/30075659 – EB2127 Mar 09 '20 at 06:35
I originally had the dictionary as a pandas dictionary, and I was using `.apply()`. It was too slow. – EB2127 Mar 09 '20 at 11:39
@EB2127 There is a big performance gain switching from `pandas` to `numpy`. In the code above the bottleneck is casting the dictionary to a numpy array. – marcos Mar 09 '20 at 13:08
@Marcos Thanks, I'll try it out. Also, I realized to myself I don't have a strong idea why `pandas` is so slow compared to numpy.... – EB2127 Mar 09 '20 at 13:52
@EB2127 _I originally had the dictionary as a pandas dictionary_ Do you mean a DataFrame? Where does the data come from before that? Can you share more of your program, so that we can properly understand what is going on? _Also, I realized to myself I don't have a strong idea why pandas is so slow compared to numpy...._ They're different libraries, for different purposes. – AMC Mar 09 '20 at 19:15
@AMC Yes, pandas DataFrame. – EB2127 Mar 09 '20 at 20:15

score 1 · Answer 2 · answered Mar 09 '20 at 01:27

If you have enough RAM:

example_dict2 = dict(zip(example_dict1.keys(), np.array(list(example_dict1.values()))**2 -13))
>>> example_dict2
{'key1': array([134676,    887, 717396, 232311, 786756, 427703, 120396, 254003,
       170556, 674028]), 'key2': array([568503, 837212, 386871,  22188,  77828,  36851,  97331,  41196,
       550551, 715703]), 'key3': array([343383, 271428, 220887, 226563, 480236, 181463, 556503, 537276,
       278771, 319212])}

Optimizing parsing of massive python dictionary, multi-threading

2 Answers2

Linked