Python fast cosine distance with cython

Question

I want to speed up the cosine distance calculation scipy.spatial.distance.cosine as much as possible so I tried to use numpy

def alt_cosine(x,y):
    return 1 - np.inner(x,y)/np.sqrt(np.dot(x,x)*np.dot(y,y))

I tried cython

from libc.math cimport sqrt
def alt_cosine_2(x,y):
    return 1 - np.inner(x,y)/sqrt(np.dot(x,x)*np.dot(y,y))

and get improvements gradually (test on numpy arrays with length 50)

>>> cosine() # ... make some timings
5.27526156300155e-05 # mean calculation time for one loop

>>> alt_cosine() 
9.913400815003115e-06

>>> alt_cosine_2()
7.0269494536660205e-06

What is the fastest way to do this? Unfortunately, I was not able to specify variable types to alt_cosine_2, I will use this function with numpy arrays with type np.float32

Add mcve - https://stackoverflow.com/help/mcve? – Divakar Jul 19 '18 at 14:55 — Divakar, Jul 19 '18 at 14:55

ead · Accepted Answer · 2018-07-20T16:39:08.997

There is a belief, that numpy's functionality cannot be sped up with help of cython or numba. But this is not entirely true: numpy's goal is to offer great performance for a wide range of scenarios, but this also means a somewhat less than perfect performance for special scenarios.

With a particular scenario at hand, you have a chance to improve on numpy's performance, even if it means to rewrite some of numpy's functionality. For example in this case we can accelerate the function by factor 4 using cython and factor 8 using numba.

Let's start with your versions as baseline (see listings at the end of the answer):

>>>%timeit cosine(x,y)   # scipy's
31.9 µs ± 1.81 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

>>>%timeit np_cosine(x,y)  # your numpy-version
4.05 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np_cosine_fhtmitchell(x,y)  # @FHTmitchell's version
4 µs ± 53.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>>%timeit np_cy_cosine(x,y)
2.56 µs ± 123 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

So I cannot see the improvement of @FHTmitchell's version, but otherwise not different from your timings.

Your vectors have only 50 elements, so there are somewhere around 200-300 ns needed for the real-calculation: everything else is overhead of calling functions. One possibility to reduce the overhead is to "inline" these functions per hand with help of cython:

%%cython 
from libc.math cimport sqrt
import numpy as np
cimport numpy as np

def cy_cosine(np.ndarray[np.float64_t] x, np.ndarray[np.float64_t] y):
    cdef double xx=0.0
    cdef double yy=0.0
    cdef double xy=0.0
    cdef Py_ssize_t i
    for i in range(len(x)):
        xx+=x[i]*x[i]
        yy+=y[i]*y[i]
        xy+=x[i]*y[i]
    return 1.0-xy/sqrt(xx*yy)

which leads to:

>>> %timeit cy_cosine(x,y)
921 ns ± 19.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Not bad! We can try to squeeze out even more performance by letting go of some safety (run time checks + ieee-754 standard) by making following changes:

%%cython  -c=-ffast-math
...

cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def cy_cosine_perf(np.ndarray[np.float64_t] x, np.ndarray[np.float64_t] y):
    ...

which leads to:

>>> %timeit cy_cosine_perf(x,y)
828 ns ± 17.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

i.e. another 10%, which means almost factor 5 faster than the numpy-version.

There is another tool which offers similar functionality/performance - numba:

import numba as nb
import numpy as np
@nb.jit(nopython=True, fastmath=True)
def nb_cosine(x, y):
    xx,yy,xy=0.0,0.0,0.0
    for i in range(len(x)):
        xx+=x[i]*x[i]
        yy+=y[i]*y[i]
        xy+=x[i]*y[i]
    return 1.0-xy/np.sqrt(xx*yy)

which leads to:

>>> %timeit nb_cosine(x,y)
495 ns ± 5.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

a speed-up 8 compared to the original numpy version.

There are some reasons why numba can be faster: Cython handles the stride of the data during the run time which prevents some optimization (e.g. vectorization). Numba seems to handle it better.

But here the difference is entirely due to less overhead for numba:

%%cython  -c=-ffast-math
import numpy as np
cimport numpy as np

def cy_empty(np.ndarray[np.float64_t] x, np.ndarray[np.float64_t] y):
    return x[0]*y[0]

import numba as nb
import numpy as np
@nb.jit(nopython=True, fastmath=True)
def nb_empty(x, y):
    return x[0]*y[0]

%timeit cy_empty(x,y)
753 ns ± 6.81 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit nb_empty(x,y)
456 ns ± 2.47 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

almost factor 2 less overhead for numba!

As @max9111 has pointed out, numpy inlines other jitted functions but also it is able to call some numpy functions with very little overhead, so the following version (replacing inner with dot):

@nb.jit(nopython=True, fastmath=True)
def np_nb_cosine(x,y):
    return 1 - np.dot(x,y)/sqrt(np.dot(x,x)*np.dot(y,y))

>>> %timeit np_nb_cosine(x,y)
605 ns ± 5.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

only about 10% slower.

Please be aware that the above comparison is only valid for vectors with 50 elements. For more elements, the situation is be completely different: the numpy-version uses parallelized mkl (or similar) implementation of the dot-product and will beat our simple tries easily.

This begs the question: is it really worth optimizing the code for a special size of the input? Sometimes the answer is "yes" and sometimes the answer is "no".

If it is possible I would got with numba + dot solution, which is very fast for small inputs but has also the full power of mkl-implementation for larger inputs.

There is also a slight difference: the first versions return a np.float64-object and the cython and numba versions a Python-float.

Listings:

from scipy.spatial.distance import cosine
import numpy as np
x=np.arange(50, dtype=np.float64)
y=np.arange(50,100, dtype=np.float64)

def np_cosine(x,y):
    return 1 - inner(x,y)/sqrt(np.dot(x,x)*dot(y,y))

from numpy import inner, sqrt, dot
def np_cosine_fhtmitchell(x,y):
    return 1 - inner(x,y)/sqrt(np.dot(x,x)*dot(y,y))

%%cython
from libc.math cimport sqrt
import numpy as np
def np_cy_cosine(x,y):
    return 1 - np.inner(x,y)/sqrt(np.dot(x,x)*np.dot(y,y))

Very good answer! It would be good to also mention, that small njitted functions called from other jitted functions will likely be inlined, avoiding the calling overhead completely. — max9111, Jul 20 '18 at 08:24

score 2 · Answer 2 · answered Jul 19 '18 at 15:39

Of the lazy ways of speeding this kind of code:

using the numexpr Python module
using the numba Python module
using SciPy equivalents of NumPy functions

Unfortunately, none of these tricks would work for you because:

dot and inner are not implemented in numexpr
numba (like Cython) would not speed up call to NumPy's functions
dot and inner are not implemented differently in scipy (they are not even available in the namespace).

Perhaps your best bet is to try to compile numpy under different underlying LA libs (e.g. LAPACK, BLAS, OpenBLAS, etc.) and compilation options (e.g. multithreading and the like) to see which combination would prove the most effective for your use case.

Good luck!

Python fast cosine distance with cython

2 Answers2