1

I modify the most efficient code from (Why this numba code is 6x slower than numpy code?) so that it can handle x1 being (n, m)

@nb.njit(fastmath=True,parallel=True)
def euclidean_distance_square_numba_v5(x1, x2):
    res = np.empty((x1.shape[0], x2.shape[0]), dtype=x2.dtype)
    for a_idx in nb.prange(x1.shape[0]):
        for o_idx in range(x2.shape[0]):
            val = 0.
            for i_idx in range(x2.shape[1]):
                tmp = x1[a_idx, i_idx] - x2[o_idx, i_idx]
                val += tmp * tmp 
            res[a_idx, o_idx] = val 
    return res

However, it is still not more efficient that the more efficient numpy's version:

def euclidean_distance_square_einsum(x1, x2):
    return np.einsum('ij,ij->i', x1, x1)[:, np.newaxis] + np.einsum('ij,ij->i', x2, x2) - 2*np.dot(x1, x2.T)

With input as

a = np.zeros((1000000,512), dtype=np.float32)
b = np.zeros((100, 512), dtype=np.float32)

The timing I got is 2.4723422527313232 for the numba code and 0.8260958194732666 for the numpy code.

  • Just to be clear: you call it as `euclidean_distance_square_numba_v5(a, b)`? or as `(b, a)`? – MSeifert Jun 04 '18 at 07:54
  • @MSeifert (b, a). Will this cause much difference? –  Jun 04 '18 at 08:03
  • Not numpy ufuncs are doing the hard job here, but highly sophisticated BLAS routines (if the dtype is float32 or float64). If you throw integers on this function things will look different. In this case numpy tries to calculate the dot product itself, with quite a naive algorithm like the code above. Even if you optimize the code above (eg. loop tiling) you wouldn't beat BLAS algorithms. – max9111 Jun 05 '18 at 17:32

1 Answers1

3

Yes, that is expected.

The first thing you must be aware of: dot-product is the working horse of the numpy-version, here for slightly smaller arrays:

>>> def only_dot(x1, x2):
        return - 2*np.dot(x1, x2.T)

>>> a = np.zeros((1000,512), dtype=np.float32)
>>> b = np.zeros((100, 512), dtype=np.float32)

>>> %timeit(euclidean_distance_square_einsum(a,b))
6.08 ms ± 312 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit(euclidean_only_dot(a,b))
5.25 ms ± 330 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

i.e. 85% of the time are spent in it.

When you look at your numba-code, that looks like a somewhat strange/unusual/more complicated version of matrix-matrix-multiplication - one could see for example the same three loops.

So basically, you are trying to beat one of the best optimized algorithms out there. Here is for example somebody trying to do it and failing. My installation uses Intel's MKL-version, which must be more sophisticated than the default-implementation, which can be found here.

Sometimes, after the whole fun one had, one has to admit that the own "reinvented wheel" is not as good as the state-of-the-art wheel... but only then one can truly appreciate its performance.

ead
  • 32,758
  • 6
  • 90
  • 153
  • This question is not about reinventing the wheel. It is about whether numba code can beat numpy as numba claims to speed up code –  Jun 05 '18 at 00:28
  • And there is nothing fancy in the lapack too other than it is fortran –  Jun 05 '18 at 00:36
  • 1
    @user2675516 Please don't take offense at the somewhat flippant "reinventing the wheel" . The numpy-version is reusing the matrix-matrix-multiplication, one of the best optimized algorithms that is out there, - what a beauty! At the beginning it was not clear to me, that this is basically what the calculation was about... – ead Jun 05 '18 at 03:29
  • @user2675516 You are right about the lapack-code, I didn't find the Intel's version my installation is using and just took the lapack version without reading it through. – ead Jun 05 '18 at 03:36