There is a belief, that numpy's functionality cannot be sped up with help of cython or numba. But this is not entirely true: numpy's goal is to offer great performance for a wide range of scenarios, but this also means a somewhat less than perfect performance for special scenarios.
With a particular scenario at hand, you have a chance to improve on numpy's performance, even if it means to rewrite some of numpy's functionality. For example in this case we can accelerate the function by factor 4 using cython and factor 8 using numba.
Let's start with your versions as baseline (see listings at the end of the answer):
>>>%timeit cosine(x,y) # scipy's
31.9 µs ± 1.81 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>>%timeit np_cosine(x,y) # your numpy-version
4.05 µs ± 19.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np_cosine_fhtmitchell(x,y) # @FHTmitchell's version
4 µs ± 53.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>>%timeit np_cy_cosine(x,y)
2.56 µs ± 123 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
So I cannot see the improvement of @FHTmitchell's version, but otherwise not different from your timings.
Your vectors have only 50 elements, so there are somewhere around 200-300 ns needed for the real-calculation: everything else is overhead of calling functions. One possibility to reduce the overhead is to "inline" these functions per hand with help of cython:
%%cython
from libc.math cimport sqrt
import numpy as np
cimport numpy as np
def cy_cosine(np.ndarray[np.float64_t] x, np.ndarray[np.float64_t] y):
cdef double xx=0.0
cdef double yy=0.0
cdef double xy=0.0
cdef Py_ssize_t i
for i in range(len(x)):
xx+=x[i]*x[i]
yy+=y[i]*y[i]
xy+=x[i]*y[i]
return 1.0-xy/sqrt(xx*yy)
which leads to:
>>> %timeit cy_cosine(x,y)
921 ns ± 19.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Not bad! We can try to squeeze out even more performance by letting go of some safety (run time checks + ieee-754 standard) by making following changes:
%%cython -c=-ffast-math
...
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def cy_cosine_perf(np.ndarray[np.float64_t] x, np.ndarray[np.float64_t] y):
...
which leads to:
>>> %timeit cy_cosine_perf(x,y)
828 ns ± 17.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
i.e. another 10%, which means almost factor 5 faster than the numpy-version.
There is another tool which offers similar functionality/performance - numba:
import numba as nb
import numpy as np
@nb.jit(nopython=True, fastmath=True)
def nb_cosine(x, y):
xx,yy,xy=0.0,0.0,0.0
for i in range(len(x)):
xx+=x[i]*x[i]
yy+=y[i]*y[i]
xy+=x[i]*y[i]
return 1.0-xy/np.sqrt(xx*yy)
which leads to:
>>> %timeit nb_cosine(x,y)
495 ns ± 5.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
a speed-up 8 compared to the original numpy version.
There are some reasons why numba can be faster: Cython handles the stride of the data during the run time which prevents some optimization (e.g. vectorization). Numba seems to handle it better.
But here the difference is entirely due to less overhead for numba:
%%cython -c=-ffast-math
import numpy as np
cimport numpy as np
def cy_empty(np.ndarray[np.float64_t] x, np.ndarray[np.float64_t] y):
return x[0]*y[0]
import numba as nb
import numpy as np
@nb.jit(nopython=True, fastmath=True)
def nb_empty(x, y):
return x[0]*y[0]
%timeit cy_empty(x,y)
753 ns ± 6.81 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit nb_empty(x,y)
456 ns ± 2.47 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
almost factor 2 less overhead for numba!
As @max9111 has pointed out, numpy inlines other jitted functions but also it is able to call some numpy functions with very little overhead, so the following version (replacing inner
with dot
):
@nb.jit(nopython=True, fastmath=True)
def np_nb_cosine(x,y):
return 1 - np.dot(x,y)/sqrt(np.dot(x,x)*np.dot(y,y))
>>> %timeit np_nb_cosine(x,y)
605 ns ± 5.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
only about 10% slower.
Please be aware that the above comparison is only valid for vectors with 50 elements. For more elements, the situation is be completely different: the numpy-version uses parallelized mkl (or similar) implementation of the dot-product and will beat our simple tries easily.
This begs the question: is it really worth optimizing the code for a special size of the input? Sometimes the answer is "yes" and sometimes the answer is "no".
If it is possible I would got with numba
+ dot
solution, which is very fast for small inputs but has also the full power of mkl-implementation for larger inputs.
There is also a slight difference: the first versions return a np.float64
-object and the cython and numba versions a Python-float.
Listings:
from scipy.spatial.distance import cosine
import numpy as np
x=np.arange(50, dtype=np.float64)
y=np.arange(50,100, dtype=np.float64)
def np_cosine(x,y):
return 1 - inner(x,y)/sqrt(np.dot(x,x)*dot(y,y))
from numpy import inner, sqrt, dot
def np_cosine_fhtmitchell(x,y):
return 1 - inner(x,y)/sqrt(np.dot(x,x)*dot(y,y))
%%cython
from libc.math cimport sqrt
import numpy as np
def np_cy_cosine(x,y):
return 1 - np.inner(x,y)/sqrt(np.dot(x,x)*np.dot(y,y))