4

I have some code that only needs to run a calculation on a subset of the rows of a 2D numpy array. E.g if the array is say:

[[0.5, 0.5, 0.5]
[0.9, 0.2, 0.8]
[0.9, 0.2, 0.8],
[0.9, 0.2, 0.2]]

I might only want to use the 1st and 3rd row. If most rows are to be excluded, it is faster to loop through the array than to do a vectorized computation.

Cython can speed up iteration, but some experimenting seems to suggest that using the newer "memoryviews" API results in a large number of extra allocations (I presume memory view wrapper objects?), which makes it considerably slower than the old np.ndarray API, and under some circumstances slower than vanilla Python.

Here is an minimal example that demonstrates this:

in file my_cython.pyx:

# cython: profile=True

import numpy as np
cimport numpy as np

def cython_test1(
    np.ndarray vectors,
    np.ndarray target_vector):

    cdef Py_ssize_t i

    for i in range(1000000):
        x = np.dot(target_vector, vectors[i])

def cython_test2(
    np.float32_t[:,:] vectors,
    np.float32_t[:] target_vector):

    cdef Py_ssize_t i

    for i in range(1000000):
        x = np.dot(target_vector, vectors[i])

Benchmarking from an ipython REPL:

from my_cython import cython_test1, cython_test2
import numpy as np 
vectors = np.random.random((1000000, 100)).astype(np.float32)
target_vector = np.random.random(100).astype(np.float32)
%timeit cython_test1(vectors, target_vector)
%timeit cython_test2(vectors, target_vector)

cython_test1: 462 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

cython_test2: 1.23 s ± 8.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This is a reduced example to show the problem, I know that in this case it clearly makes no sense to run a dot product on every single element of a numpy array sequentially.

Running with the profiler attached:

import pstats, cProfile 
cProfile.runctx("cython_test2(vectors, target_vector)", globals(), locals(), "Profile.prof") 
s = pstats.Stats("Profile.prof") 
s.strip_dirs().sort_stats("time").print_stats() 

Result:

1    2.070    2.070    3.263    3.263 speedy.pyx:52(cython_test2)
2000000    0.705    0.000    0.867    0.000 stringsource:995(memoryview_fromslice)
4000000    0.155    0.000    0.155    0.000 stringsource:514(__getbuffer__)
2000002    0.090    0.000    0.090    0.000 stringsource:372(__dealloc__)
2000002    0.082    0.000    0.082    0.000 stringsource:345(__cinit__)
2000000    0.081    0.000    0.081    0.000 stringsource:972(__dealloc__)
2000000    0.080    0.000    0.080    0.000 stringsource:555(__get__)
...

Maybe it is creating a new slice view for each 1D vector that is accessed from the 2D matrix? If so, maybe the answer is "don't use memory views in this particular instance". But that seems a bit at odds with the documentation that says the newer API is faster and suggests it is always preferable.

ead
  • 32,758
  • 6
  • 90
  • 153
junichiro
  • 5,282
  • 3
  • 18
  • 26
  • 1
    https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html, which introduces `typed memoryviews`, demonstrates their use in iterative code from which it can easily create `C` code. Your test uses `np.dot` – hpaulj Apr 01 '19 at 20:52
  • You can use Cython for speeding numpy to some degree in some cases. `np.dot` is not one of them, see for example https://stackoverflow.com/q/10442365/5769463 – ead Apr 02 '19 at 04:34
  • Maybe try `numba`? – juanpa.arrivillaga Apr 02 '19 at 05:05
  • np.dot is only one part of the calculations in the real example. I didn't expect np.dot to become faster, but the overhead created by memory views makes the overall loop slower. The old non-memoryview API manages to speed things up without extra allocations. For sure I can imagine that in other circumstances, with a higher ratio of C code to numpy code, that memory views would be faster. But I was wondering if there was a more definitive answer as to how much overhead memory views add over the old cython API vs how much faster the new memory view API makes non-numpy code. – junichiro Apr 02 '19 at 07:04
  • 1
    My experience is that the initial check for a memoryview is slower than for the `ndarray` syntax. It isn't quite like-for-like because you aren't checking the internal data type for the `ndarray` version. My guess would be that using Cython but not specifying any types will be fastest here. – DavidW Apr 02 '19 at 07:45
  • 1
    There is something wrong with both versions (eg. quite the same in Numba results in 37ms per call). I am not that experienced in Cython, but it looks like both versions have a lot of interaction with the Python Interpreter in the line `x = np.dot(target_vector, vectors[i])` which should be avoided at all. – max9111 Apr 02 '19 at 12:14
  • it looks like numba does something special for numpy calls like np.dot, so I'll definitely look into numba. It also looks much simpler to use! – junichiro Apr 02 '19 at 14:42

0 Answers0