I have some code that only needs to run a calculation on a subset of the rows of a 2D numpy array. E.g if the array is say:
[[0.5, 0.5, 0.5]
[0.9, 0.2, 0.8]
[0.9, 0.2, 0.8],
[0.9, 0.2, 0.2]]
I might only want to use the 1st and 3rd row. If most rows are to be excluded, it is faster to loop through the array than to do a vectorized computation.
Cython can speed up iteration, but some experimenting seems to suggest that using the newer "memoryviews" API results in a large number of extra allocations (I presume memory view wrapper objects?), which makes it considerably slower than the old np.ndarray API, and under some circumstances slower than vanilla Python.
Here is an minimal example that demonstrates this:
in file my_cython.pyx:
# cython: profile=True
import numpy as np
cimport numpy as np
def cython_test1(
np.ndarray vectors,
np.ndarray target_vector):
cdef Py_ssize_t i
for i in range(1000000):
x = np.dot(target_vector, vectors[i])
def cython_test2(
np.float32_t[:,:] vectors,
np.float32_t[:] target_vector):
cdef Py_ssize_t i
for i in range(1000000):
x = np.dot(target_vector, vectors[i])
Benchmarking from an ipython REPL:
from my_cython import cython_test1, cython_test2
import numpy as np
vectors = np.random.random((1000000, 100)).astype(np.float32)
target_vector = np.random.random(100).astype(np.float32)
%timeit cython_test1(vectors, target_vector)
%timeit cython_test2(vectors, target_vector)
cython_test1: 462 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
cython_test2: 1.23 s ± 8.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This is a reduced example to show the problem, I know that in this case it clearly makes no sense to run a dot product on every single element of a numpy array sequentially.
Running with the profiler attached:
import pstats, cProfile
cProfile.runctx("cython_test2(vectors, target_vector)", globals(), locals(), "Profile.prof")
s = pstats.Stats("Profile.prof")
s.strip_dirs().sort_stats("time").print_stats()
Result:
1 2.070 2.070 3.263 3.263 speedy.pyx:52(cython_test2)
2000000 0.705 0.000 0.867 0.000 stringsource:995(memoryview_fromslice)
4000000 0.155 0.000 0.155 0.000 stringsource:514(__getbuffer__)
2000002 0.090 0.000 0.090 0.000 stringsource:372(__dealloc__)
2000002 0.082 0.000 0.082 0.000 stringsource:345(__cinit__)
2000000 0.081 0.000 0.081 0.000 stringsource:972(__dealloc__)
2000000 0.080 0.000 0.080 0.000 stringsource:555(__get__)
...
Maybe it is creating a new slice view for each 1D vector that is accessed from the 2D matrix? If so, maybe the answer is "don't use memory views in this particular instance". But that seems a bit at odds with the documentation that says the newer API is faster and suggests it is always preferable.