2

I am currently working on a modeling environment in python which uses dicts to share connection properties of connected parts. My current way of doing this takes about 15-20% of my total program run time which is quite alot having a few million iterations...

So I'm find myself looking at how to speed up updating multiple values in dicts and getting multiple values from dicts.
My example dict looks like this (number of key-value-pairs is expected to stay in the current range of 300 to 1000, thus I filled it to this amount):

val_dict = {'a': 5.0, 'b': 18.8, 'c': -55/2}
for i in range(200):
    val_dict[str(i)] = i
    val_dict[i] = i**2

keys = ('b', 123, '89', 'c')
new_values = np.arange(10, 41, 10)
length = new_values.shape[0]

While keys and the shape of new_values and the number of key-value-pairs in val_dict will always be constant, the values of new_values change at each iteration and thus have to be updated at each iteration (and also be retrieved at each iteration from within another part of my code).

I timed several methods, where getting multiple values from dicts seems to be the fastest by using itemgetter from the operator module. I can define getter before the iteration starts, because the needed variables are constant:

getter = itemgetter(*keys)
%timeit getter(val_dict)
The slowest run took 10.45 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 140 ns per loop

I guess this is quite ok, or is there anything faster?

But when assigning these values to a numpy array by masking, it slows down horribly:

result = np.ones(25)
idx = np.array((0, 5, 8, -1))
def getter_fun(result, idx, getter, val_dict):
    result[idx] = getter(val_dict)
%timeit getter_fun(result, idx, getter, new_values)
The slowest run took 11.44 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.77 µs per loop

Is there any way I can improve that? I guess the tuple unpacking is the worst part here...

For setting multiple values I've timed a few ways to do it: A function which unpacks the values, a function which uses update with key-value-pairs given, a function using a for-loop, a dict comprehension and a generator function.

def unpack_putter(val_dict, keys, new_values):
    (val_dict[keys[0]],
     val_dict[keys[1]],
     val_dict[keys[2]],
     val_dict[keys[3]]) = new_values
%timeit unpack_putter(val_dict, keys, new_values)
The slowest run took 8.85 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.29 µs per loop

def upd_putter(val_dict, keys, new_values):
    val_dict.update({keys[0]: new_values[0],
    keys[1]: new_values[1],
    keys[2]: new_values[2],
    keys[3]: new_values[3]})
%timeit upd_putter(val_dict, keys, new_values)
The slowest run took 15.22 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 963 ns per loop

def for_putter(val_dict, keys, new_values, length):
    for i in range(length):
        val_dict[keys[i]] = new_values[i]
%timeit for_putter(val_dict, keys, new_values, length)
The slowest run took 12.31 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.14 µs per loop

def dictcomp_putter(val_dict, keys, new_values, length):
    val_dict.update({keys[i]: new_values[i] for i in range(length)})
%timeit dictcomp_putter(val_dict, keys, new_values, length)
The slowest run took 7.13 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.69 µs per loop

def gen_putter(val_dict, keys, new_values, length):
    gen = ((keys[i], new_values[i]) for i in range(length))
    val_dict.update(dict(gen))
%timeit gen_putter(val_dict, keys, new_values, length)
The slowest run took 10.03 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.54 µs per loop

The upd_putter would be the fastest, but can I somehow use it with alternating shape of keys and new_values (they will still be constant during iterations, but each part which is considered has a different amount of keys to update which has to be determined by user input). Interestingly the for-loop seems quite ok to me. So I guess I'm doing it wrong and there must be a faster way to do it.

One last thing to consider: I'll most probably use Cython soon, so I guess this will make the for loop favorable? Or I could use joblib to parallelize the for loop. I also thought about using numba, but then I'd have to get rid of all dicts...

Hopefully you can help me with this problem.

edit for MSeifert (even though I'm not sure if you meant it like that):

tuplelist = list()
for i in range(200):
    tuplelist.append(i)
    tuplelist.append(str(i))
keys_long = tuple(tuplelist)
new_values_long = np.arange(0,400)

%timeit for_putter(val_dict, keys_long, new_values_long, 400)
10000 loops, best of 3: 73.5 µs per loop    
%timeit dictcomp_putter(val_dict, keys_long, new_values_long, 400)
10000 loops, best of 3: 96.4 µs per loop    
%timeit gen_putter(val_dict, keys_long, new_values_long, 400)
10000 loops, best of 3: 129 µs per loop
JE_Muc
  • 5,403
  • 2
  • 26
  • 41
  • Could you include timings for non-trivial amounts of keys? Like most of the timings told you it's likely that the results are cached somewhere. Instead of 4 keys it would be good to know how it performs with 1000 or more (I realize it might be a pain with the manual indexing :D ). – MSeifert Aug 25 '17 at 13:20
  • Ok, I added it for the non-manual indexing. Is that what you meant? – JE_Muc Aug 25 '17 at 13:35
  • 1
    Do you need dictionaries at all times? If not, perhaps you can operate with two lists one containing the keys and another the values. You operate on the values (numpy can be exploited) and once you need the dict you zip the keys and values to form a dict. – Ignacio Vergara Kausel Aug 25 '17 at 13:37
  • I thought about that as well. Currently all parts (class instances stored in a parent class by composition) have IDs and all ports also have IDs. But I was not sure if the performance boost would be big enough to rewrite my code (and that would mean rewriting ALOT of code...). – JE_Muc Aug 25 '17 at 13:51

1 Answers1

5

Let's focus on two very important things right now that have nothing to do with performance: maintainability and scalability.

The first two approaches with the manual indexing:

(val_dict[keys[0]],
 val_dict[keys[1]],
 val_dict[keys[2]],
 val_dict[keys[3]]) = new_values

and

val_dict.update({keys[0]: new_values[0],
                 keys[1]: new_values[1],
                 keys[2]: new_values[2],
                 keys[3]: new_values[3]})

hardcode (a maintenance nightmare) the number of elements you insert, so these approaches just don't scale very well. Therefore I won't include them in the rest of the answer. I'm not saying that these are bad - they just don't scale very well and it's hard to compare the timings of a function that only works for a specific amount of entries.

First let me present two more approaches based on zip (use itertools.izip if you're using python-2.x):

def new1(val_dict, keys, new_values, length):
    val_dict.update(zip(keys, new_values))

def new2(val_dict, keys, new_values, length):
    for key, val in zip(keys, new_values):
        val_dict[key] = val

which would be the "most pythonic" approaches to solve this (at least in my opinion).

I also change the new_values to a list because iterating over NumPy arrays is worse than converting the array to a list and then iterating over the list In case you're interested in the details, I elaborated on that part in another answer.

Let's see how the approaches perform:

import numpy as np

def old_for(val_dict, keys, new_values, length):
    for i in range(length):
        val_dict[keys[i]] = new_values[i]

def old_update_comp(val_dict, keys, new_values, length):
    val_dict.update({keys[i]: new_values[i] for i in range(length)})

def old_update_gen(val_dict, keys, new_values, length):
    gen = ((keys[i], new_values[i]) for i in range(length))
    val_dict.update(dict(gen))

def new1(val_dict, keys, new_values, length):
    val_dict.update(zip(keys, new_values))

def new2(val_dict, keys, new_values, length):
    for key, val in zip(keys, new_values):
        val_dict[key] = val

val_dict = {'a': 1, 'b': 2, 'c': 3}
keys = ('b', 123, '89', 'c')
new_values = np.arange(10, 41, 10).tolist()
length = len(new_values)
%timeit old_for(val_dict, keys, new_values, length)
# 4.1 µs ± 183 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit old_update_comp(val_dict, keys, new_values, length)
# 9.56 µs ± 180 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit old_update_gen(val_dict, keys, new_values, length)
# 17 µs ± 332 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit new1(val_dict, keys, new_values, length)
# 5.92 µs ± 123 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit new2(val_dict, keys, new_values, length)
# 3.23 µs ± 84.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And with more keys and values:

val_dict = {'a': 1, 'b': 2, 'c': 3}
keys = range(1000)
new_values = range(1000)
length = len(new_values)
%timeit old_for(val_dict, keys, new_values, length)
# 1.08 ms ± 26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit old_update_comp(val_dict, keys, new_values, length)
# 1.08 ms ± 13.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit old_update_gen(val_dict, keys, new_values, length)
# 1.44 ms ± 31.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit new1(val_dict, keys, new_values, length)
# 242 µs ± 3.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit new2(val_dict, keys, new_values, length)
# 346 µs ± 8.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So for bigger inputs my approaches seem to be much faster (2-5 times) than your approaches.

You could try to improve your approaches using Cython, unfortunatly Cython doesn't support comprehensions in cdef or cpdef functions so I only cythonized the other approaches:

%load_ext cython

%%cython

cpdef new1_cy(dict val_dict, tuple keys, new_values, Py_ssize_t length):
    val_dict.update(zip(keys, new_values.tolist()))

cpdef new2_cy(dict val_dict, tuple keys, new_values, Py_ssize_t length):
    for key, val in zip(keys, new_values.tolist()):
        val_dict[key] = val

cpdef new3_cy(dict val_dict, tuple keys, int[:] new_values, Py_ssize_t length):
    cdef Py_ssize_t i
    for i in range(length):
        val_dict[keys[i]] = new_values[i]

This time I made keys a tuple and new_values to be a NumPy array so they work with the defined Cython functions:

import numpy as np

val_dict = {'a': 1, 'b': 2, 'c': 3}
keys = tuple(range(4))
new_values = np.arange(4)
length = len(new_values)
%timeit new1(val_dict, keys, new_values, length)
# 7.88 µs ± 317 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit new2(val_dict, keys, new_values, length)
# 4.4 µs ± 140 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit new2_cy(val_dict, keys, new_values, length)
# 5.51 µs ± 56.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

val_dict = {'a': 1, 'b': 2, 'c': 3}
keys = tuple(range(1000))
new_values = np.arange(1000)
length = len(new_values)
%timeit new1_cy(val_dict, keys, new_values, length)
# 208 µs ± 9.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit new2_cy(val_dict, keys, new_values, length)
# 231 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit new3_cy(val_dict, keys, new_values, length)
# 156 µs ± 4.13 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

So if you have a tuple and a numpy array you can achieve almost a factor 2 speedup with the function that uses normal indexing and a memoryview new3_cy. At least if you have lots of key-value pairs that need to be inserted.


Note I didn't adress getting multiple values from the dict because operator.itemgetter is probably the best way to do that.

MSeifert
  • 145,886
  • 38
  • 333
  • 352
  • Thanks alot! Yeah, I just started with the topic of increasing the performance, so there might be alot of things where I'm a newbie... The manual indexing was just for comparison. I could somehow implement it for a reasonable number of keys (there won't ever be more than 10 keys which have to be updated at a time) with try and so on, but leaving it out is the best. I also had an approach with zip before but forgot about it. I guess using a list for new_values is helping alot! Thanks! – JE_Muc Aug 25 '17 at 13:45
  • 1
    I tried to keep it rather basic but if something in the answer doesn't make sense to you or should need more elaboration feel free to comment on it. I'll update the answer then (but it's weekend right now, so don't expect it to happen in short order :-] ). – MSeifert Aug 25 '17 at 13:55
  • Yeah, I guess I'll go with your zip-and-list-answer! I implemented it in my code and it reduced the update-values-time in 500k iterations by about 400ms and added about 190ms by tolist. That is a speedbost of almost 30%! Thanks alot! Yeah, it s weekend here as well, so I most probably will be back on monday... – JE_Muc Aug 25 '17 at 14:02
  • 1
    @Scotty1- I added some Cython versions of the functions. Just cythonized in IPython - but they could be faster. :) – MSeifert Aug 25 '17 at 14:18
  • Oh wow, thanks alot for that! Since I'm quite new to Cython, that really helps me ALOT! Unluckily I can only upvote your answer once, but hopefully others will also upvote it! – JE_Muc Aug 25 '17 at 14:21
  • I just stumbled upon the problem, that my `new_values` array has in fact alot more elements than the `new_values` that are passed to the dict, let's say `new_values=np.arange(1,51)`. That means I am masking it with for example `new_vals_idx=np.array((0, 3, 5, -1))` to get `new_values=new_values[new_vals_idx]`. But this is really slow in Cython and I can't get it to work with memoryviews. I tried some code in [here](https://pastebin.com/983kyMjf), but `nocy` seems to be the fastest, while `for_zip` is not working at all. I also tried you `new2` but is slow. Is there any way to speed up masking? – JE_Muc Aug 29 '17 at 13:21
  • Oh and btw off-topic... I just realized that you are from Heidelberg. Well, I grew up in Neuenheim. :) – JE_Muc Aug 29 '17 at 13:23
  • I think this would be a bit complicated to discuss in the comments. I think that would justify a new question. Feel free to post a comment here with a link to the question :) – MSeifert Aug 29 '17 at 13:23
  • Ok, just opened it [here](https://stackoverflow.com/questions/45942567/performance-of-numpy-array-masking-in-cython) :) – JE_Muc Aug 29 '17 at 15:08