Can I do better on filtering numpy array

Question

I have a somewhat contrived example to cytonize, where I want a function to:

accept a 1D numpy array of arbitrary length (~100'000 ÷ 1'000'000 np.float64's)
do some filtering on it
return results as a new [numpy?] array of the same length

The code and profiling is as follows:

%%cython -a

from libc.stdlib cimport malloc, free
from cython cimport boundscheck, wraparound
import numpy as np

@boundscheck(False)
@wraparound(False)
def func_memview(double[:] arr):
    cdef:
        int N = arr.shape[0], i
        double *out_ptr = <double *> malloc(N * sizeof(double))
        double[:] out = <double[:N]>out_ptr
    for i in range(1, N):
        if arr[i] > arr[i-1]:
            out[i] = arr[i]
        else:
            out[i] = 0.
    free(out_ptr)
    return np.asarray(out)

My question is can I do any better with this?

`free(out_ptr); return np.asarray(out)` can definitely be done better! Personally I'd just use `np.empty((N,))` at the start of the function and avoid `malloc`. — DavidW, Mar 18 '21 at 11:22
@DavidW Doing `out = np.empty(N,)` at the top gives me 15x worse timing... May be an example as answer? — Sergey Bushmanov, Mar 18 '21 at 11:32
The trouble with your current code is that `out` is a view of the memory of `out_ptr`. If you delete `out_ptr` then you can't use `out` either. In the short-term you may get away with it but eventually the memory will be overwritten. I think (but I'm not 100% sure) that `np.asarray(out)` isn't making another copy, but is always another view of the same (invalid) memory. `out = np.empty(N,)` may be slower, but it does work! I don't have have an easy way to testing a full answer right now, but I'll write one up later if no-one else gives one first. — DavidW, Mar 18 '21 at 11:36
@DavidW If you can show some code which will improve on timing I'd really appreciate it. — Sergey Bushmanov, Mar 18 '21 at 11:39
@DavidW This is correct. But I doubt this will change speed. — Sergey Bushmanov, Mar 18 '21 at 13:43
There is also the slight issue that if you stick with the c pointer version, you need to check the result of malloc to ensure it is not NULL, which can happen if the memory allocation were to fail (e.g. your RAM is fragmented or there is not sufficient space to allocate the array). — CodeSurgeon, Mar 18 '21 at 13:56

ead · Accepted Answer · 2021-03-18T14:03:04.620

As DavidW has pointed out, your code has some issues with memory management and it would be better to use a numpy-array directly:

%%cython

from cython cimport boundscheck, wraparound
import numpy as np

@boundscheck(False)
@wraparound(False)
def func_memview_correct(double[:] arr):
    cdef:
        int N = arr.shape[0], i
        double[:] out = np.empty(N)
    for i in range(1, N):
        if arr[i] > arr[i-1]:
            out[i] = arr[i]
        else:
            out[i] = 0.0
    return np.asarray(out)

It is about as fast as the faulty original version:

import numpy as np
np.random.seed(0)
k= np.random.rand(5*10**7)

%timeit func_memview(k)          # 413 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func_memview_correct(k)  # 412 ms ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The question is how this code could be made faster? Most obvious options are

Parallelization.
Using vectorization/SIMD instructions.

It is notoriously hard to ensure that the C-code generated by Cython gets vectorized, see for example this SO-post. For many compilers it is necessary to use contiguous memory view to improve the situation, i.e.:

%%cython -c=/O3

from cython cimport boundscheck, wraparound
import numpy as np

@boundscheck(False)
@wraparound(False)
def func_memview_correct_cont(double[::1] arr):  // <---- HERE
    cdef:
        int N = arr.shape[0], i
        double[::1] out = np.empty(N)   // <--- HERE
    for i in range(1, N):
        if arr[i] > arr[i-1]:
            out[i] = arr[i]
        else:
            out[i] = 0.0
    return np.asarray(out)

On my machine it is not really much faster

%timeit func_memview_correct_cont(k)  # 402 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Other compilers might do better. However, I've often seen gcc and msvc struggling with producing optimal assembler for code typical for filtering (see for example this SO-question). Clang is much better at this, so the easiest solution would be probably to use numba:

import numba as nb

@nb.njit
def nb_func(arr):
    N = arr.shape[0]
    out = np.empty(N)
    for i in range(1, N):
        if arr[i] > arr[i-1]:
            out[i] = arr[i]
        else:
            out[i] = 0.0
    return out

which outperforms the cython code by almost factor of 3:

%timeit nb_func(k)  # 151 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

It is easy to parallelize the numba version using prange, but the win is not that much: parallelized version runs in 116ms on my machine.

To summarize: For such type of tasks my advice is to use numba. Using cython is trickier and the final performance will be down to the compiler used in the background.

Thanks for the answer! Can you please clarify what are the issues with the original implementation and why do you call it faulty? I seems to me I copied dynamically allocated array verbatim from "Cython: A Guide for Python Programmers" by Smith, 2015, and conceptually both seem to me using buffers, with memory managed manually (mine) and by numpy (yours). Mine is "faulty". Why? — Sergey Bushmanov, Mar 18 '21 at 13:41
@SergeyBushmanov not sure I could put it better than https://stackoverflow.com/questions/66690006/can-i-do-better-on-filtering-numpy-array/66692041#comment117889956_66690006 As soon as you free `out_ptr`, everything that is a view of it (`out`, returned numpy-array) becomes invalid. What you can do is this https://stackoverflow.com/a/60856020/5769463 — ead, Mar 18 '21 at 13:57

Can I do better on filtering numpy array

1 Answers1