assignment into arbitrary array locations using cython. assignment speed depends on value?

Question

I'm seeing some bizarre behavior in cython code. I'm writing code to compute a forward Kalman filter, but I have a state transition model which has many 0s in it, so it would be nice to be able to calculate only certain elements of the covariance matrices.

So to test this out, I wanted to fill individual array elements using cython. To my surprise, I found

that writing output to specific array locations is very slow (function fill(...)), compared to just assigning it to a scalar variable every time (function nofill(...)) (essentially forgetting the results), and
setting C=0.1 or 31, while not affecting how long nofill(...) takes to run, the latter choice for C makes fill(...) run 2x as slowly. This is baffling to me. Can anyone explain why I'm seeing this?

Code:-

#################  file way_too_slow.pyx
from libc.math cimport sin

#  Setting C=0.1 or 31 doesn't change affect performance of calling nofill(...), but it makes the fill(...) slower.  I have no clue why.
cdef double C = 0.1

#  This function just throws away its output.

def nofill(double[::1] x, double[::1] y, long N):
    cdef int i
    cdef double *p_x = &x[0]
    cdef double *p_y = &y[0]
    cdef double d

    with nogil:
        for 0 <= i < N:
            d = ((p_x[i] + p_y[i])*3 + p_x[i] - p_y[i]) + sin(p_x[i]*C)  #  C appears here

#  Same function keeps its output.
#  However:   #1 - MUCH slower than 
def fill(double[::1] x, double[::1] y, double[::1] out, long N):
    cdef int i
    cdef double *p_x = &x[0]
    cdef double *p_y = &y[0]
    cdef double *p_o = &out[0]
    cdef double d

    with nogil:
        for 0 <= i < N:
            p_o[i] = ((p_x[i] + p_y[i])*3 + p_x[i] - p_y[i]) + sin(p_x[i]*C)    # C appears here

The above code is called by the python program

####################  run_way_too_slow.py
import way_too_slow as _wts
import time as _tm

N = 80000
x = _N.random.randn(N)
y = _N.random.randn(N)
out  = _N.empty(N)

t1 = _tm.time()
_wts.nofill(x, y, N)
t2 = _tm.time()
_wts.fill(x, y, out, N)
t3 = _tm.time()

print "nofill() ET: %.3e" % (t2-t1)
print "fill()   ET: %.3e" % (t3-t2)

print "fill() is slower by factor %.3f" % ((t3-t2)/(t2-t1))

The cython was compiled using the setup.py file

#################  setup.py
from distutils.core import setup, Extension
from distutils.sysconfig import get_python_inc
from distutils.extension import Extension
from Cython.Distutils import build_ext

incdir=[get_python_inc(plat_specific=1)]
libdir = ['/usr/local/lib']

cmdclass = {'build_ext' : build_ext}

ext_modules = Extension("way_too_slow",
                        ["way_too_slow.pyx"],
                        include_dirs=incdir,   #  include_dirs for Mac
                        library_dirs=libdir)

setup(
    name="way_too_slow",
    cmdclass = cmdclass,
    ext_modules = [ext_modules]
)

Here is a typical output of running "run_way_too_slow.py" using C=0.1

>>> exf("run_way_too_slow.py")
nofill() ET: 6.700e-05
fill()   ET: 6.409e-04
fill() is slower by factor 9.566

A typical run with C=31.

>>> exf("run_way_too_slow.py")
nofill() ET: 6.795e-05
fill()   ET: 1.566e-03
fill() is slower by factor 23.046

As we can see

Assigning into specified array location is quite slow compared to assigning to a double.
For some reason, assigning speed seems to depend on what operation was done in the calculation - which makes no sense to me.

Any insight would be greatly appreciated.

score 5 · Accepted Answer · answered Mar 03 '18 at 06:49

Two things explain your observation:

A: in the first version nothing happens. The c compiler is clever enough to see, that the whole loop has no effect at all outside of the function and optimizes it out.

To enforce the execution you must make the result d visible outside, for example via:

cdef double d=0
....
       d+=....
return d

It might be still slower than writing-to-an-array-version, because of the fewer costly memory accesses - but you will see a slowdown when changing value C.

B: sin is a complicated function and how long it takes to calculate depends on its argument. For example, for very small arguments - the argument itself can be returned, but for bigger argument much longer Taylor series must be evaluated. Here is one example for cost of tanh, depending on the value of the argument, which like sin is calculated via different approximation/ Taylor series - the most important part the time needed depends on argument.

Thanks! Part A) was the key to make everything make sense. I didn't realize compilers were so smart, but when I made the loop actually do something as you suggested, runtimes became comparable, as expected. Because of what was happening in part A), the value of C was having no effect in nofill(...) because that code was probably not being run at all, while for fill(...), the runtime cost of sin was being directly reflected. — ken, Mar 03 '18 at 10:06

assignment into arbitrary array locations using cython. assignment speed depends on value?

1 Answers1