2
#cython: boundscheck=False, wraparound=False, nonecheck=False, cdivision=True, language_level=3
cpdef int query(double[::1] q, double[:,::1] data) nogil:
    cdef:
        int n = data.shape[0]
        int dim = data.shape[1]
        int best_i = -1
        double best_ip = -1
        double ip
    for i in range(n):
        ip = 0
        for j in range(dim):
            ip += q[j] * data[i, j]
        if ip > best_ip:
            best_i = i
            best_ip = ip
    return best_i

After compiling, I time the code from Python:

import numpy as np
import ip
n, dim = 10**6, 10**2
X = np.random.randn(n, dim)
q = np.random.randn(dim)
%timeit ip.query(q, X)

This takes roughly 100ms. Meanwhile the equivalent numpy code:

%timeit np.argmax(q @ X.T)

Takes just around 50ms.

This is odd, since the NumPy code seemingly has to allocate the big array q @ X.T before taking the argmax. I thus wonder if there are some optimizations I am lacking?

I have added extra_compile_args=["-O3", '-march=native'], to my setup.py and I also tried changing the function definition to

cpdef int query(np.ndarray[double] q, np.ndarray[double, ndim=2] data):

but it had virtually no difference in performance.

Thomas Ahle
  • 30,774
  • 21
  • 92
  • 114
  • 1
    Numpy probably parallelizes some of it – DavidW Jan 14 '21 at 17:46
  • @DavidW I don't think numpy uses (thread) parallelization. I certainly only uses one core when I time it. It might use some vectorization/bit parallelism. But I feel the C compiler should do that automatically in my code too? – Thomas Ahle Jan 14 '21 at 18:31

1 Answers1

2

The operation q @ X.T will be mapped to an implementation of matrix-vector-multiplication (dgemv) from either OpenBlas or MKL (depending on your distribution) under the hood - that means you are against one of the best optimized algorithms out there.

The resulting vector has 1M elements, which results in about 8MB memory. 8MB will not always fit into L3-cache, but even RAM has about 15GB/s bandwith, thus writing/reading 8MB will take at most 1-2ms - not much gain compared to about 50ms of the overall running time.

The most obvios issue with your code, is that it doesn't calculate the same as q @X.T. It calculates

((q[0]*data[i,0]+q[1]*data[i,1])+q[2]*data[i,2])+...

Because of IEEE 754 the compiler is not allowed to reorder the operations and executes them in this non-optimal order: in order to calculate the second sum, the operation must wait until the first summation is performed. This approach doesn't use the full potential of modern architectures.

A good dgemv implementation will choose a much better order of operations. A similar issue, but with sums, can be found in this SO-post.

To level the field one could use -ffast-math, which allows compiler to reoder operations and thus make a better use of pipelines.

Here are results on my machine for your benchmark:

%timeit query(q, X)            # 101 ms
%timeit query_ffastmath(q, X)  # 56.3 ms
%timeit np.argmax(q @ X.T)     # 50.2 ms

It is still about 10% worse, but I would be really surprised if compiler could beat a hand-crafted version created by experts especially for my processor.

ead
  • 32,758
  • 6
  • 90
  • 153
  • Interesting!, but adding `-ffast-math` doesn't change anything for me. Also, shouldn't it be part of `-O3`? Maybe I'm not adding the arguments right... – Thomas Ahle Jan 14 '21 at 21:07
  • @ThomasAhle -ffast-math is not part of `-O3`, because even for `-O3` IEEE 754 is not violated + -ffast-math has some other effects which aren't part of `-O3`. – ead Jan 14 '21 at 21:12
  • I think 12% difference is pretty good! Thanks! – Thomas Ahle Jan 14 '21 at 21:27