#cython: boundscheck=False, wraparound=False, nonecheck=False, cdivision=True, language_level=3
cpdef int query(double[::1] q, double[:,::1] data) nogil:
cdef:
int n = data.shape[0]
int dim = data.shape[1]
int best_i = -1
double best_ip = -1
double ip
for i in range(n):
ip = 0
for j in range(dim):
ip += q[j] * data[i, j]
if ip > best_ip:
best_i = i
best_ip = ip
return best_i
After compiling, I time the code from Python:
import numpy as np
import ip
n, dim = 10**6, 10**2
X = np.random.randn(n, dim)
q = np.random.randn(dim)
%timeit ip.query(q, X)
This takes roughly 100ms. Meanwhile the equivalent numpy code
:
%timeit np.argmax(q @ X.T)
Takes just around 50ms.
This is odd, since the NumPy
code seemingly has to allocate the big array q @ X.T
before taking the argmax. I thus wonder if there are some optimizations I am lacking?
I have added extra_compile_args=["-O3", '-march=native'],
to my setup.py and I also tried changing the function definition to
cpdef int query(np.ndarray[double] q, np.ndarray[double, ndim=2] data):
but it had virtually no difference in performance.