Further optimization of simple cython code

Question

I have a function written in cython that computes a certain measure of correlation (distance correlation) via a double for loop:

%%cython -a
import numpy as np

def distances_Matrix(X):
    return np.array([[np.linalg.norm(xi-xj) for xi in X] for xj in X])

def c_dCov(double[:, :] a, double[:, :] b, int n):
    cdef int i
    cdef int j
    cdef double U       =  0
    cdef double W1      =  n/(n-1)
    cdef double W2      =  2/(n-2)
    cdef double[:] a_M  =  np.mean(a,axis=1)
    cdef double    a_   =  np.mean(a)
    cdef double[:] b_M  =  np.mean(b,axis=1)
    cdef double    b_   =  np.mean(b)

    for i in range(n):
        for j in range(n):
            if i != j:
                U = U + (a[i][j] + W1*(-a_M[i]-a_M[j]+a_)) * (b[i][j] +   W1*(-b_M[i]-b_M[j]+b_))
            else:
                U = U - W2*(W1**2)*(a_M[i] - a_) * (b_M[i] - b_)
    return U/(n*(n-3))

def c_dCor(X,Y):
    n     =  len(X)
    a     =  distances_Matrix(X)
    b     =  distances_Matrix(Y)
    V_XX  =  c_dCov(a,a,n) 
    V_YY  =  c_dCov(b,b,n)
    V_XY  =  c_dCov(a,b,n)
    return V_XY/np.sqrt(V_XX*V_YY)

When I compile this fragment of code I get the following report of optimization by the compiler:

Line 23 is still quite yellow, which indicates significant python interactions, how can I make that line further optimized?.

The operations done on that line are quite trivial, just products and sums, since I did specify the types of every array and variable used in that function, why do I get such a bad performance on that line?

Thanks in advance.

I was able to optimize line 23 by adding the `@cython.boundscheck(False)` decorator to `c_dCov `. That alone was enough to turn yellow to white. Though `@cython.boundscheck(False)` isn't in your code snippet, it does look like it's in the optimization report. Did you already try using `boundscheck`, and did it not work? If so, you might just need to upgrade your Cython package. What version are you currently using? — tel, Nov 13 '18 at 04:43
The most obvious thing to optimize is the distances_Matrix function. eg. https://stackoverflow.com/q/50675705/4045774 (Don't forget to add a sqrt to euclidean_distance_square_einsum). — max9111, Nov 13 '18 at 10:03
@max9111 I will now attempt to optimize distances_Matrix. Do you think doing it all with pure python loops in order to cythonize the function is a good approach? Thanks in advance for your insight. — Ettore Majorana, Nov 13 '18 at 15:37
At least for a bit larger problems, there isn't so much optimization potential, apart from a fast BLAS backend (eg. Intel MKL). You can simply take the euclidean_distance_square_einsum function and add a np.sqrt . — max9111, Nov 13 '18 at 15:59

tel · Accepted Answer · 2018-11-13T07:05:52.037

Short answer: disable bounds checking in the c_dCov function by adding the following decorator on the line right before it:

cimport cython
@cython.boundscheck(False)  # Deactivate bounds checking
def c_dCov(double[:, :] a, double[:, :] b, int n):

Alternatively, you can add a compiler directive to the top of your code. Right after your Cython magic line you would put:

%%cython -a
#cython: language_level=3, boundscheck=False

If you had a setup.py file, you could also globally turn bounds checking off there:

from distutils.core import setup
from Cython.Build import cythonize

setup(
    name="foo",
    ext_modules=cythonize('foo.pyx', compiler_directives={'boundscheck': False}),
)

Regardless of how it was done, disabling bounds checks was by itself enough to get the following optimization report:

Some other optimizations suggested by the Cython docs are turning off indexing with negative numbers, and declaring that your arrays are guaranteed to have a contiguous layout in memory. With all of those optimizations, the signature of c_dCov would become:

cimport cython
@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
def c_dCov(double[:, ::1] a, double[:, ::1] b, int n):

but only @cython.boundscheck(False) was needed to get the better optimization report.

Now that I look closer though, even though you don't have those optimizations in your code snippet, you do have the boundscheck(False) and wraparound(False) decorators in the code in your optimization report. Did you already try those and they didn't work? What version of Cython are you running? Maybe you need an upgrade.

Explanation

Every time you access an array by index, a bounds check occurs. This is so that when you have an array arr of shape (5,5) and you try to access arr[19,27], your program will spit out an error instead of letting you access out of bounds data. However, for the sake of speed, some languages don't do bounds check on array access (eg C/C++). Cython lets you optionally turn off bounds checks in order to optimize performance. With Cython, you can either disable bounds checking globally for a whole program with the boundscheck compiler directive, or for a single function with the @cython.boundscheck(False) decorator.

Thanks a lot! Indeed I had those decorators but they didn't seem to have any effect on the optimization, so I removed them in the end. I updated my cython version and now the report looks like yours. — Ettore Majorana, Nov 13 '18 at 15:31
@Ettore Majorana Declaring contigous arrays like c_dCov(double[:, ::1] or disabling boundschecking can have a quite significant effect if you get working SIMD vectorization by this. This also heavily depends on the compiler flags. Nevertheless the creation of the distances matrix should more time consuming than the simple loop you wanted to optimize in the first place. — max9111, Nov 13 '18 at 19:19

Further optimization of simple cython code

1 Answers1

Explanation