cython memoryview not faster than ndarray

Question

I have a function written in regular numpy ndarray and another one with a typed memoryview. However, I couldn't get the memoryview version to work faster than the regular version (unlike many of the blogs such as memoryview benchmarks).

Any pointers / suggestions to increase the speed of the memoryview code versus the numpy alternative will be greatly appreciated! ... OR ... if anyone can point out any glaring reason why the memoryview version is not much faster than the regular numpy version

In the code below there are two functions, both of which takes in two vectors bi and xi and returns a matrix. The first function shrink_correl is the regular numpy version and the second function shrink_correl2 is the memoryview alternative (let the file be sh_cor.pyx).

# cython: boundscheck=False
# cython: wraparound=False
# cython: cdivision=True

cimport cython
cimport numpy as np
import numpy as np
from numpy cimport ndarray as ar

# -- ***this is the Regular Cython version*** -
cpdef ar[double, ndim=2, mode='c'] shrink_correl(ar[double, ndim=1, mode='c'] bi, ar[double, ndim=1, mode='c'] xi):
    cdef:
        int n_ = xi.shape[0]
        int n__ = int(n_*(n_-1)/2)
        ar[double, ndim=2, mode='c'] f = np.zeros([n__, n_+1])
        int x__ = 0
        ar[double, ndim=2, mode='c'] f1 = np.zeros([n_, n_+1])
        ar[double, ndim=2, mode='c'] f2 = np.zeros([n__, n_+1])
        ar[double, ndim=1, mode='c'] g = np.zeros(n_+1)
        ar[double, ndim=1, mode='c'] s = np.zeros(n__)
        ar[double, ndim=2, mode='c'] cori_ = np.zeros([n_, n_])
        Py_ssize_t j, k

    with nogil:
        for j in range(0, n_-1):
            for k in range(j+1, n_):
                x__ += 1
                f[x__-1, j] = bi[k]*xi[k]*1000
                f[x__-1, k] = bi[j]*xi[j]*1000
    f1 = np.dot(np.transpose(f), f)      
    with nogil:
        for j in range(0, n_):
            f1[n_, j] = xi[j]*1000
            f1[j, n_] = f1[n_, j]
    f2 = np.dot(f, np.linalg.inv(f1))
    with nogil:
        for j in range(0, n_):
            g[j] = -bi[j]*xi[j]*1000

    s = np.dot(f2, g)

    with nogil:
        for j in range(0, n_):
            cori_[j, j] = 1.0
    x__ = 0

    with nogil:
        for j in range(0, n_-1):
            for k in range(j+1, n_):
                x__ += 1
                cori_[j, k] = s[x__-1]
                cori_[k, j] = cori_[j, k]
    return cori_

# -- ***this is the MemoryView Cython version*** -    
cpdef ar[double, ndim=2, mode='c'] shrink_correl2(double[:] bi, double[:] xi):
    cdef:
        int n_ = xi.shape[0]
        int n__ = int(n_*(n_-1)/2)
        double[:, ::1] f = np.zeros([n__, n_+1])
        int x__ = 0
        double[:, ::1] f1 = np.zeros([n_, n_+1])
        double[:, ::1] f2 = np.zeros([n__, n_+1])
        double[:] g = np.zeros(n_+1)
        double[:] s = np.zeros(n__)
        double[:, ::1] cori_ = np.zeros([n_, n_])
        ar[double, ndim=2, mode='c'] cori__ = np.zeros([n_, n_])
        Py_ssize_t j, k
    with nogil:
        for j in range(0, n_-1):
            for k in range(j+1, n_):
                x__ += 1
                f[x__-1, j] = bi[k]*xi[k]*1000
                f[x__-1, k] = bi[j]*xi[j]*1000
    f1 = np.dot(np.transpose(f), f)      
    with nogil:
        for j in range(0, n_):
            f1[n_, j] = xi[j]*1000
            f1[j, n_] = f1[n_, j]
    f2 = np.dot(f, np.linalg.inv(f1))
    with nogil:
        for j in range(0, n_):
            g[j] = -bi[j]*xi[j]*1000

    s = np.dot(f2, g)

    with nogil:
        for j in range(0, n_):
            cori_[j, j] = 1.0
    x__ = 0

    with nogil:
        for j in range(0, n_-1):
            for k in range(j+1, n_):
                x__ += 1
                cori_[j, k] = s[x__-1]
                cori_[k, j] = cori_[j, k]
    cori__[:, :] = cori_
    return cori__

This is compiled using the following setup.py code

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
import numpy as np
import os

ext_modules = [Extension('sh_cor', ['sh_cor.pyx'], include_dirs=[np.get_include(),
                                                                 os.path.join(np.get_include(), 'numpy')],
                         define_macros=[('NPY_NO_DEPRECATED_API', None)],
                         extra_compile_args=['-O3', '-march=native', '-ffast-math', '-flto'],
                         libraries=['m']
                         )]

setup(
    name="Sh Cor",
    cmdclass={'build_ext': build_ext},
    ext_modules=ext_modules
)

The code used to test the speeds is

import numpy as np
import sh_cor  # this the library created by the setup.py file
import time

b = np.random.random(400)
b = b/np.sum(b)

x = np.random.random(400)-0.5

n = 10 

t0 = time.time()
for i in range(n):
    v1 = sh_cor.shrink_correl(b, x)
t1 = time.time()
print((t1-t0)/n)

t0 = time.time()
for i in range(n):
    v2 = sh_cor.shrink_correl2(b, x)
t1 = time.time()
print((t1-t0)/n)

The output on my PC is:

0.7070999860763549   # regular numpy
0.6726999998092651   # memoryview

using memoryview (in the codes above) only gives me a 5% speed boost (unlike the huge speed boost in the blogs).

What's different? Just the specification of the function parameters? When I see that amount of code without explanation my eye's glaze over. — hpaulj, Oct 05 '15 at 16:58
My impression from http://docs.cython.org/src/userguide/memoryviews.html is that you get most benefit from `memoryviews` when you focus their use at low level operations. But I haven't studied you code enough to know if that's your case. — hpaulj, Oct 05 '15 at 17:56
Like hpaulj, I my eyes glazed over a little too at the full code... I personally haven't seen a huge speed boost for memoryviews, and I've found they seem to have a slightly higher overhead on calling. I think they're a nicer (and more general interface) but you shouldn't expect much more than that. — DavidW, Oct 05 '15 at 18:40
Additionally - the advantages the blog you link to talks about are for when you're creating slices e.g. `double [:] x = some_2d_array[:,0]`, which is apparently inefficient with `ndarrays`. Since you don't do that that here you probably won't get that benefit. Straight-forward element access is likely to be very similar (look at the C-code here to check). — DavidW, Oct 06 '15 at 10:30
One difference that I see between the two is that, in your first function input `ndarrays` as well `s` and `g` are `c-contiguous` whereas in second function they are not. Use `double [::1]` instead of `double[:]`. Other than that, most of your operations require the same numpy C-API, so it is certain that you are not going to see any speed-up. If you let cython spit the html file (`cython -a`), you will essentially see a similar C code. — romeric, Oct 06 '15 at 12:34
@uday you just need to use OpenMP and put your loops in prange statements and you'll use all your cores, let me know if you want me to post an answer with your edited Cython files. In my experience ndarray and memory views seem to always have similar performance. Why else go through the hassle of nogil? You should also rewrite the np.transpose and np.dot sections in a Cython function without the gil to get max performance — Matt, Jun 19 '17 at 16:32
@Matt , thanks for your reply. Is it possible to post your answer. That would be much appreciated. Kind regards — uday, Jun 26 '17 at 22:41
As others have mentioned if you are just iterating and doing basic slicing you wont see much improvement. If you had to create multiple views of your data, then you might see a bit more overhead for numpy. In general there is no huge difference. — Imanol Luengo, Jun 27 '17 at 16:00

Matt · Answer 1 · 2017-06-29T08:21:45.277

@uday Give me about a week as I'm computer less, but here's where to speed things up to get you started: 1) instead of attaching the gil using np.transpose create a memoryview identical to what you want to transpose BEFORE any loops (I.e. you'll have variable f declared as a memoryview that won't need the gil and just create a view on that f_t, I.e. cdef double[:, ::1] f_T = np.transpose(f) or just =f.T.

2) This step is a little more tricky as you need a C/C++ style wrapper version of np.dot (so in this case, be sure the call to the dgemm function is with nogil: above it & indent the function the next line to release the gil with 4 space indent SO requires): https://gist.github.com/pv/5437087. That example looks to work (although you'll have to save the include f2pyptr.h file out and put it where your project is being built; I also suspect you should add cimport numpy as np); if not and it needs mods you can see as I've done in another post here: Calling BLAS / LAPACK directly using the SciPy interface and Cython (pointer issue?)/- also how to add MKL Then you'll need to add from cython.parallel cimport prange at the top and change all the loops to prange from range and be sure all your prange sections are nogil and all variables are cdef declared before operated on . In addition you'll have to add -openmp to your setup.py in the compiler arguments as well as a link to its include libs. Ask more questions if you need clarification. This isn't as easy as it should be but with a little guidance becomes quite simple. Basically once your setup.py is modified to include everything it will work going forward.

3) although probably easiest to fix - get rid of that List. Make it a numpy array or pandas dataframe if you need text and data. Whenever I've used lists for data the slowdown had been incredible.

cython memoryview not faster than ndarray

1 Answers1