As DavidW has pointed out, your code has some issues with memory management and it would be better to use a numpy-array directly:
%%cython
from cython cimport boundscheck, wraparound
import numpy as np
@boundscheck(False)
@wraparound(False)
def func_memview_correct(double[:] arr):
cdef:
int N = arr.shape[0], i
double[:] out = np.empty(N)
for i in range(1, N):
if arr[i] > arr[i-1]:
out[i] = arr[i]
else:
out[i] = 0.0
return np.asarray(out)
It is about as fast as the faulty original version:
import numpy as np
np.random.seed(0)
k= np.random.rand(5*10**7)
%timeit func_memview(k) # 413 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func_memview_correct(k) # 412 ms ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The question is how this code could be made faster? Most obvious options are
- Parallelization.
- Using vectorization/SIMD instructions.
It is notoriously hard to ensure that the C-code generated by Cython gets vectorized, see for example this SO-post. For many compilers it is necessary to use contiguous memory view to improve the situation, i.e.:
%%cython -c=/O3
from cython cimport boundscheck, wraparound
import numpy as np
@boundscheck(False)
@wraparound(False)
def func_memview_correct_cont(double[::1] arr): // <---- HERE
cdef:
int N = arr.shape[0], i
double[::1] out = np.empty(N) // <--- HERE
for i in range(1, N):
if arr[i] > arr[i-1]:
out[i] = arr[i]
else:
out[i] = 0.0
return np.asarray(out)
On my machine it is not really much faster
%timeit func_memview_correct_cont(k) # 402 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Other compilers might do better. However, I've often seen gcc and msvc struggling with producing optimal assembler for code typical for filtering (see for example this SO-question). Clang is much better at this, so the easiest solution would be probably to use numba
:
import numba as nb
@nb.njit
def nb_func(arr):
N = arr.shape[0]
out = np.empty(N)
for i in range(1, N):
if arr[i] > arr[i-1]:
out[i] = arr[i]
else:
out[i] = 0.0
return out
which outperforms the cython code by almost factor of 3:
%timeit nb_func(k) # 151 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
It is easy to parallelize the numba version using prange
, but the win is not that much: parallelized version runs in 116ms on my machine.
To summarize: For such type of tasks my advice is to use numba. Using cython is trickier and the final performance will be down to the compiler used in the background.