2

For my study, I have to implement pairwise distance L1-distance calculation between two sets of vectors, each represented as a NumPy matrix (vectors are rows). This has to be done using two loops, one loop and no loops. I expected that since NumPy is so great with vectorization, algorithms must rank as two-loops slower than one-loop slower than no-loops.

I've written the functions:

def f_cdist_2(X1, X2):
    res = np.zeros(shape=(X1.shape[0], X2.shape[0]), dtype=np.float64)

    for ix1 in range(X1.shape[0]):
        for ix2 in range(X2.shape[0]):
            res[ix1, ix2] = np.abs(X1[ix1, :] - X2[ix2, :]).sum()

    return res


def f_cdist_1(X1, X2):
    res = np.zeros(shape=(X1.shape[0], X2.shape[0]), dtype=np.float64)

    for ix1 in range(X1.shape[0]):
        res[ix1, :] = np.abs(np.tile(X1[ix1, :], (X2.shape[0], 1)) - X2).sum(axis=1)

    return res


def f_cdist_0(X1, X2):
    res = np.abs(
            np.tile(X1[:, :, np.newaxis], (1, 1, X2.shape[0])) - \
            np.tile(X2.T[np.newaxis, :, :], (X1.shape[0], 1, 1))
    ).sum(axis=1)

    return res

Then I tested the performance with two random matrices of shapes 128 x 512 and 256 x 512, based on 100 runs I've got the results:

  1. Two loops: 156 msec

  2. One loop: 32 msec

  3. No loops: 135 msec

I also tried cdist from scipy.spatial.distance, and got the best performance of all: 9 msec.

Now, is there a better way to implement no-loops function? I hoped it to perform at least as good as one-loop, but for now I'm at loss as to how to improve it.

UPDATE

Using kwinkunks's implementation of no-loops approach, I've re-run tests (again 100 trials) on matrices 1024 x 1024, results are below:

  1. Two loops: 5.7 sec

  2. One loop: 6.6 sec

  3. No loops: 3.9 sec

  4. scipy.spatial.distance.cdist: 0.6 sec

So on larger matrices, no-loops implementation indeed works better. scipy makes wonders, but if I understand correctly, it is written on C, hence such a great performance.

UPDATE

Tried with 4096 x 1024 matrices of np.float64, same setup:

  1. Two loops: 88 sec

  2. One loop: 66 sec

  3. No loops: Ran out of memory (had ~ 18 Gb of free RAM at the moment)

  4. scipy.spatial.distance.cdist: 13 sec

fanvacoolt
  • 121
  • 3
  • Interesting study! I'm curious if it the result will be any different if you use much bigger matrices. Have you tried with much bigger shapes? – Ricky Kim Apr 25 '19 at 18:08
  • That is a good insight. I think the memory of the broadcast of the large arrays is creating considerable overhead. This is a surprising result! – nick Apr 25 '19 at 18:20
  • Tried just now, updated post with results. No-loops now is the best of three. Will try with even larger shapes, but that's going to take significant time. I'll leave it overnight. – fanvacoolt Apr 25 '19 at 20:01
  • You can do this a lot faster than cdist. eg. https://stackoverflow.com/a/42994680/4045774 or https://stackoverflow.com/a/53380192/4045774 – max9111 Apr 26 '19 at 12:41
  • @max9111, a lot faster than `cdist` from `scipy`? Could you please say how? I browsed through questions you suggested, and implemented distance calculations with `np.einsum` like this: `np.einsum('ijk -> ij', np.abs(X1[:, None, :] - X2[None, :, :]))`, but it performs pretty much like no-loops implementation above. Is there a better way? – fanvacoolt Apr 26 '19 at 16:12
  • Sorry, the links were for euclidian distances. There you can make following simplification (a-b)^2 = a^2 + b^2 - 2ab. – max9111 Apr 29 '19 at 00:11

3 Answers3

4

You can get extra speedup from the vectorized version using Pythran

f_dist.py:

import numpy as np
#pythran export f_dist(float64[:,:], float64[:,:])
def f_dist(X1, X2):
    return np.sum(np.abs(X1[:, None, :] - X2), axis=-1)

On my laptop, the original version runs at:

> python -m timeit -s 'from f_dist import f_dist; from numpy.random import random; x = random((100,100)); y = random((100,100))' 'f_dist(x, y)'
100 loops, best of 3: 7.05 msec per loop

Once you compile the kernel:

> pythran f_dist.py

You can benchmark it:

> python -m timeit -s 'from f_dist import f_dist; from numpy.random import random; x = random((100,100)); y = random((100,100))' 'f_dist(x, y)'
1000 loops, best of 3: 1.21 msec per loop

Using SIMD instructions further speeds-up the computation:

> pythran f_dist.py -DUSE_XSIMD -march=native
> python -m timeit -s 'from f_dist import f_dist; from numpy.random import random; x = random((100,100)); y = random((100,100))' 'f_dist(x, y)'
1000 loops, best of 3: 774 usec per loop

Disclaimer: I'm the core dev of the pythran project.

serge-sans-paille
  • 2,109
  • 11
  • 9
  • Could you also benchmark with #pythran export f_dist(float64[:,::1], float64[:,::1])? At least in Numba or Cython, declaring non-contigous arrays often hurts SIMD-vectorization. – max9111 Apr 29 '19 at 00:04
  • It triggers a compilation error :-) Thanks for the extra test case :-) *In theroy* non contiguous array would prevent vectorization. – serge-sans-paille Apr 29 '19 at 21:20
  • Bug fixed locally, and as expected, there's no SIMD processing of the main loop when using sliced array. – serge-sans-paille Apr 30 '19 at 16:25
0

You can avoid the tiling etc with NumPy's broadcasting:

def f_dist(X1, X2):
    return np.sum(np.abs(X1[:, None, :] - X2), axis=-1)

But, surprisingly (to me anyway) it's not faster than your loop (ca. 90 ms on my machine, compared to 24 ms for your f_cdist_1() function).

That broadcasting trick is often useful. It means you can do things like this:

>>> np.array([1,2,3]) * np.array([10, 20, 30])[:, None]
array([[10, 20, 30],
       [20, 40, 60],
       [30, 60, 90]])
Matt Hall
  • 7,614
  • 1
  • 23
  • 36
  • I knew there was a smarter way than using tiling! With your code no-loops takes 50 ms vs. 25 ms one-loop. Still slower, unfortunately, but much closer now. – fanvacoolt Apr 25 '19 at 18:20
  • I've a feeling it's slower because it has to allocate a large array of shape (128, 256, 512) before summing it. – Matt Hall Apr 25 '19 at 18:28
0

A solution using Numba

  • Parallelized (on very small samples eg. (24x24) the parallelized version will be slower due to the overhead of creating threads)
  • The inner loop is SIMD-vectorized

Exmaple

import numpy as np
import numba as nb

#Debug output for SIMD-vectorization
import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')
########################################

#Your solution
#You can also use Numba on this, but apart from parallization
#it is often better to write out the inner loop
def f_cdist(X1, X2):
    res = np.zeros(shape=(X1.shape[0], X2.shape[0]), dtype=np.float64)

    for ix1 in range(X1.shape[0]):
        for ix2 in range(X2.shape[0]):
            res[ix1, ix2] = np.abs(X1[ix1, :] - X2[ix2, :]).sum()

    return res

@nb.njit(fastmath=True,parallel=True)
def f_cdist_nb(X1, X2):
    #Some safety, becuase there is no bounds-checking
    assert X1.shape[1]==X2.shape[1]
    res = np.empty(shape=(X1.shape[0], X2.shape[0]), dtype=X1.dtype)

    for ix1 in nb.prange(X1.shape[0]):
        for ix2 in range(X2.shape[0]):
            #Writing out the inner loop often leads to better performance
            sum=0.
            for i in range(X1.shape[1]):
                sum+=np.abs(X1[ix1, i] - X2[ix2, i])
            res[ix1, ix2] = sum

    return res

Perfomance

from scipy import spatial
#4096x1024    
X1=np.random.rand(4096,1024)
X2=np.random.rand(4096,1024)

res1=f_cdist_nb(X1,X2)
res2=f_cdist(X1,X2)
res3=spatial.distance.cdist(X1, X2, 'cityblock')

#Check the results
np.allclose(res1,res2)
True
np.allclose(res1,res3)
True

%timeit res1=f_cdist_nb(X1,X2)
1.38 s ± 64.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res2=f_cdist(X1,X2)
1min 25s ± 483 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res3=spatial.distance.cdist(X1, X2, 'cityblock')
17.6 s ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#1024x1024
X1=np.random.rand(1024,1024)
X2=np.random.rand(1024,1024)

%timeit res1=f_cdist_nb(X1,X2)
63.5 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res3=spatial.distance.cdist(X1, X2, 'cityblock')
1.09 s ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#512x512
X1=np.random.rand(512,512)
X2=np.random.rand(512,512)

%timeit res1=f_cdist_nb(X1,X2)
4.91 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit res3=spatial.distance.cdist(X1, X2, 'cityblock')
130 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Edit: A hand optimized Numba version

#Unroll and Jam loops
@nb.njit(fastmath=True,parallel=True)
def f_cdist_nb_3(X1, X2):
    assert X1.shape[1]==X2.shape[1]
    res = np.empty(shape=(X1.shape[0], X2.shape[0]), dtype=X1.dtype)

    for ix1 in nb.prange(X1.shape[0]//4):
        for ix2 in range(X2.shape[0]//4):
            sum_1,sum_2,sum_3,sum_4,sum_5,sum_6   =0.,0.,0.,0.,0.,0.
            sum_7,sum_8,sum_9,sum_10,sum_11,sum_12=0.,0.,0.,0.,0.,0.
            sum_13,sum_14,sum_15,sum_16=0.,0.,0.,0.

            for i in range(X1.shape[1]):
                sum_1+=np.abs(X1[ix1*4+0, i] - X2[ix2*4+0, i])
                sum_2+=np.abs(X1[ix1*4+0, i] - X2[ix2*4+1, i])
                sum_3+=np.abs(X1[ix1*4+0, i] - X2[ix2*4+2, i])
                sum_4+=np.abs(X1[ix1*4+0, i] - X2[ix2*4+3, i])
                sum_5+=np.abs(X1[ix1*4+1, i] - X2[ix2*4+0, i])
                sum_6+=np.abs(X1[ix1*4+1, i] - X2[ix2*4+1, i])
                sum_7+=np.abs(X1[ix1*4+1, i] - X2[ix2*4+2, i])
                sum_8+=np.abs(X1[ix1*4+1, i] - X2[ix2*4+3, i])
                sum_9+=np.abs(X1[ix1*4+2, i] - X2[ix2*4+0, i])
                sum_10+=np.abs(X1[ix1*4+2, i] - X2[ix2*4+1, i])
                sum_11+=np.abs(X1[ix1*4+2, i] - X2[ix2*4+2, i])
                sum_12+=np.abs(X1[ix1*4+2, i] - X2[ix2*4+3, i])
                sum_13+=np.abs(X1[ix1*4+3, i] - X2[ix2*4+0, i])
                sum_14+=np.abs(X1[ix1*4+3, i] - X2[ix2*4+1, i])
                sum_15+=np.abs(X1[ix1*4+3, i] - X2[ix2*4+2, i])
                sum_16+=np.abs(X1[ix1*4+3, i] - X2[ix2*4+3, i])

            res[ix1*4+0, ix2*4+0] = sum_1
            res[ix1*4+0, ix2*4+1] = sum_2
            res[ix1*4+0, ix2*4+2] = sum_3
            res[ix1*4+0, ix2*4+3] = sum_4
            res[ix1*4+1, ix2*4+0] = sum_5
            res[ix1*4+1, ix2*4+1] = sum_6
            res[ix1*4+1, ix2*4+2] = sum_7
            res[ix1*4+1, ix2*4+3] = sum_8
            res[ix1*4+2, ix2*4+0] = sum_9
            res[ix1*4+2, ix2*4+1] = sum_10
            res[ix1*4+2, ix2*4+2] = sum_11
            res[ix1*4+2, ix2*4+3] = sum_12
            res[ix1*4+3, ix2*4+0] = sum_13
            res[ix1*4+3, ix2*4+1] = sum_14
            res[ix1*4+3, ix2*4+2] = sum_15
            res[ix1*4+3, ix2*4+3] = sum_16

    #Rest of the loop
    for ix1 in range(X1.shape[0]//4*4,X1.shape[0]):
        for ix2 in range(X2.shape[0]):
            sum_1=0.
            for i in range(X1.shape[1]):
                sum_1+=np.abs(X1[ix1, i] - X2[ix2, i])
            res[ix1, ix2] = sum_1

    for ix1 in range(X1.shape[0]):
        for ix2 in range(X2.shape[0]//4*4,X2.shape[0]):
            sum_1=0.
            for i in range(X1.shape[1]):
                sum_1+=np.abs(X1[ix1, i] - X2[ix2, i])
            res[ix1, ix2] = sum_1
    return res

Timings

#4096x1024    
X1=np.random.rand(4096,1024)
X2=np.random.rand(4096,1024)

res1=f_cdist_nb(X1,X2)
res2=f_cdist_nb_3(X1,X2)
res3=spatial.distance.cdist(X1, X2, 'cityblock')

#Check the results
print(np.allclose(res1,res2))
print(np.allclose(res1,res3))

%timeit res1=f_cdist_nb(X1,X2)
1.6 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res2=f_cdist_nb_3(X1,X2)
497 ms ± 50.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res3=spatial.distance.cdist(X1, X2, 'cityblock')
17.7 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
max9111
  • 6,272
  • 1
  • 16
  • 33