Performance of numpy array masking in Cython

Question

As a follow up of this question here (thanks MSeifert for your help) I came up with the problem that I have to mask a numpy array new_values with an index array new_vals_idx before passing the masked array to update val_dict.

To the proposed solutions in answer of MSeifert in the old post I tried to apply the array masking, but the performance is not satisfying.
The arrays and dicts I used for the following examples are:

import numpy as np
val_dict = {'a': 5.0, 'b': 18.8, 'c': -55/2}
for i in range(200):
    val_dict[str(i)] = i
    val_dict[i] = i**2

keys = ('b', 123, '89', 'c')  # dict keys to update
new_values = np.arange(1, 51, 1) / 1.0  # array with new values which has to be masked
new_vals_idx = np.array((0, 3, 5, -1))  # masking array
valarr = np.zeros((new_vals_idx.shape[0]))  # preallocation for masked array
length = new_vals_idx.shape[0]

To make my code-snippets easier to compare with my old question, I'll stick to the function naming of MSeifert's answer. These are my tries to get the best performance out of python/cython (the other answers were left out because of too poor performance):

def old_for(val_dict, keys, new_values, new_vals_idx, length):
    for i in range(length):
        val_dict[keys[i]] = new_values[new_vals_idx[i]]
%timeit old_for(val_dict, keys, new_values, new_vals_idx, length)
# 1000000 loops, best of 3: 1.6 µs per loop

def old_for_w_valarr(val_dict, keys, new_values, valarr, new_vals_idx, length):
    valarr = new_values[new_vals_idx]
    for i in range(length):
        val_dict[keys[i]] = valarr[i]
%timeit old_for_w_valarr(val_dict, keys, new_values, valarr, new_vals_idx, length)
# 100000 loops, best of 3: 2.33 µs per loop

def new2_w_valarr(val_dict, keys, new_values, valarr, new_vals_idx, length):
    valarr = new_values[new_vals_idx].tolist()
    for key, val in zip(keys, valarr):
        val_dict[key] = val
%timeit new2_w_valarr(val_dict, keys, new_values, valarr, new_vals_idx, length)
# 100000 loops, best of 3: 2.01 µs per loop

Cython functions:

%load_ext cython
%%cython
import numpy as np
cimport numpy as np
cpdef new3_cy(dict val_dict, tuple keys, double[:] new_values, int[:] new_vals_idx, Py_ssize_t length):
    cdef Py_ssize_t i
    cdef double val  # this gives about 10 µs speed boost compared to directly assigning it to val_dict
    for i in range(length):
        val = new_values[new_vals_idx[i]]
        val_dict[keys[i]] = val
%timeit new3_cy(val_dict, keys, new_values, new_vals_idx, length)
# 1000000 loops, best of 3: 1.38 µs per loop

cpdef new3_cy_mview(dict val_dict, tuple keys, double[:] new_values, int[:] new_vals_idx, Py_ssize_t length):
    cdef Py_ssize_t i
    cdef int[:] mview_idx = new_vals_idx
    cdef double [:] mview_vals = new_values
    for i in range(length):
        val_dict[keys[i]] = mview_vals[mview_idx[i]]
%timeit new3_cy_mview(val_dict, keys, new_values, new_vals_idx, length)
# 1000000 loops, best of 3: 1.38 µs per loop

# NOT WORKING:
cpdef new2_cy_mview(dict val_dict, tuple keys, double[:] new_values, int[:] new_vals_idx, Py_ssize_t length):
    cdef double [new_vals_idx] masked_vals = new_values
    for key, val in zip(keys, masked_vals.tolist()):
        val_dict[key] = val

cpdef new2_cy_mask(dict val_dict, tuple keys, double[:] new_values, valarr, int[:] new_vals_idx, Py_ssize_t length):
    valarr = new_values[new_vals_idx]
    for key, val in zip(keys, valarr.tolist()):
        val_dict[key] = val

The Cython functions new3_cy and new3_cy_mview do not seem to be considerably faster than old_for. Passing valarr to avoid array construction inside the function (as it is going to be called several million times) even seems to slow it down.
Masking in new2_cy_mask with the new_vals_idx array in Cython gives me the error: 'Invalid index for memoryview specified, type int[:]'. Is there any type like Py_ssize_t for arrays of indexes?
Trying to create a masked memoryview in new2_cy_mview gives me the error 'Cannot assign type 'double[:]' to 'double [__pyx_v_new_vals_idx]''. Is there even something like masked memoryviews? I wasn't able to find information on this topic...

Comparing the timing results with those from my old question I guess that the array masking is the process taking up most of the time. And as it is most likely already highly optimized in numpy, there is probably not much to do. But the slow-down is so huge, that there must be (hopefully) a better way to do it.
Any help is appreciated! Thanks in advance!

I'm not sure you'll do much better than `new3_cy` - this is fundamentally a python heavy function (reading from tuple and writing to dict), so the 15-20% speedup range is about what you can typically get. — chrisb, Aug 29 '17 at 15:53
Ok, thanks for that information! I could get rid of tuples really easy and of dicts with a little effort. Just the numpy arrays need to bei Usedom. What speedup would you expect and which datatypes would you recommend? — JE_Muc, Aug 29 '17 at 16:35
If you need strings and/or heterogeneous types and/or hash lookups, it's hard to beat the builtin `dict`. If somehow your problem reduces to using only single, numeric types and numpy arrays, speedups of 10x+ aren't uncommon, just depends. — chrisb, Aug 29 '17 at 17:15
Ok, then I'll try introducing numeric IDs instead of `dict`lookups with strings. And the `tuple`s? Since immutability is a nice feature for the general user of my code and due to the high pure python speed, I thought it might be a good idea to use `tuple`s. Should I go for numpy arrays instead? I could set them to `array.flags.writeable = False`, but afaik that makes them incompatible with Cython... Well, immutability is not necessarily required... — JE_Muc, Aug 30 '17 at 09:06
It's a trade off, if your current approach was fast enough I'd keep using dict and tuples, simpler and safer. But to maximize performance, yes you essentially want everything in numpy arrays or equivalent so you can work with the data at the c-level. — chrisb, Aug 30 '17 at 10:48
Ok, I just rewrote parts of my code to using arrays instead of dicts. This slowed down the rewritten functions by about 11% compared to `old_for` and about 16% compared to `new3_cy`. I guess that my for loops in these functions are too short with a maximum of 4 iterations, combined with the problem that my outer code with the huge for loop is still pure python, which would be at least a week to rewrite it... If I'm going to rewrite it: Which datatype is best used to store instances of subclasses in a parent class for class composition in Cython? Currently I'm using `dict`, but perhaps `list`? — JE_Muc, Aug 30 '17 at 12:33

score 1 · Accepted Answer · answered Aug 29 '17 at 17:24

One thing you can do in the current construction is turn off bounds checking (if it's safe to!). Won't make a huge difference, but some incremental performance.

%%cython
import numpy as np
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef new4_cy(dict val_dict, tuple keys, double[:] new_values, int[:] new_vals_idx, Py_ssize_t length):
    cdef Py_ssize_t i
    cdef double val  # this gives about 10 µs speed boost compared to directly assigning it to val_dict
    for i in range(length):
        val = new_values[new_vals_idx[i]]
        val_dict[keys[i]] = val

In [36]: %timeit new3_cy(val_dict, keys, new_values, new_vals_idx, length)
1.76 µs ± 209 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [37]: %timeit new4_cy(val_dict, keys, new_values, new_vals_idx, length)
1.45 µs ± 31.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

I quite much need wraparound. Not using it would reduce the modularity of my program for other users than me. Only diabling boundscheck is a minor speedup, but I guess it will be worth it once I tested everything. — JE_Muc, Aug 30 '17 at 09:02

Performance of numpy array masking in Cython

1 Answers1

Linked