I have two large sets of 2D points and I need to calculate a distance matrix.
I needed it to be fast and in python, so obviously I used numpy. I recently learned about numpy broadcasting and used it, rather than looping in python numpy will do it in C.
I really thought broadcasting is all I need until I see other methods working better than vanilla broadcasting, I have two ways I calculate distance matrix and I don't understand why one is better than other.
I looked up here https://github.com/numpy/numpy/issues/14761 and I have contradictory results.
Below are two ways to calculate distance matrix
Cells [3, 4, 6] and [8, 9] both calculate the distance matrix, but 3+4 uses subtract.outer is way faster than 8 which uses vanilla broadcast and 6 which uses hypot is way faster than 9 which is simple way. I did not try in python loop assumming it will never finish.
I want to know
1. Is there a faster way to calculate distance matrix (maybe scikit-learn or scipy)?
2. Why hypot and subtract.outer are so fast?
I have also attached snippet tp run whole thing for convenience and I change seed to prevent cache resue
### Cell 1
import numpy as np
np.random.seed(858442)
### Cell 2
%%time
obs = np.random.random((50000, 2))
interp = np.random.random((30000, 2))
CPU times: user 2.02 ms, sys: 1.4 ms, total: 3.42 ms
Wall time: 1.84 ms
### Cell 3
%%time
d0 = np.subtract.outer(obs[:,0], interp[:,0])
CPU times: user 2.46 s, sys: 1.97 s, total: 4.42 s
Wall time: 4.42 s
### Cell 4
%%time
d1 = np.subtract.outer(obs[:,1], interp[:,1])
CPU times: user 3.1 s, sys: 2.7 s, total: 5.8 s
Wall time: 8.34 s
### Cell 5
%%time
h = np.hypot(d0, d1)
CPU times: user 12.7 s, sys: 24.6 s, total: 37.3 s
Wall time: 1min 6s
### Cell 6
np.random.seed(773228)
### Cell 7
%%time
obs = np.random.random((50000, 2))
interp = np.random.random((30000, 2))
CPU times: user 1.84 ms, sys: 1.56 ms, total: 3.4 ms
Wall time: 2.03 ms
### Cell 8
%%time
d = obs[:, np.newaxis, :] - interp
d0, d1 = d[:, :, 0], d[:, :, 1]
CPU times: user 22.7 s, sys: 8.24 s, total: 30.9 s
Wall time: 33.2 s
### Cell 9
%%time
h = np.sqrt(d0**2 + d1**2)
CPU times: user 29.1 s, sys: 2min 12s, total: 2min 41s
Wall time: 6min 10s
Update thanks to Jérôme Richard here
- Stackoverflow never disappoints
- There is a faster way using numba
- It has just in time compiler which will convert python snippet to fast machine code, the first time you use it will be little slower than subsequent use since it compiles. But even for first time njit parallel beats hypot + subtract.outer by 9x margin for (49000, 12000) matrix
Performance of various methods
- make sure to use different seed each time running script
import sys
import time
import numba as nb
import numpy as np
np.random.seed(int(sys.argv[1]))
d0 = np.random.random((49000, 2))
d1 = np.random.random((12000, 2))
def f1(d0, d1):
print('Numba without parallel')
res = np.empty((d0.shape[0], d1.shape[0]), dtype=d0.dtype)
for i in nb.prange(d0.shape[0]):
for j in range(d1.shape[0]):
res[i, j] = np.sqrt((d0[i, 0] - d1[j, 0])**2 + (d0[i, 1] - d1[j, 1])**2)
return res
# Add eager compilation, compiles before hand
@nb.njit((nb.float64[:, :], nb.float64[:, :]), parallel=True)
def f2(d0, d1):
print('Numba with parallel')
res = np.empty((d0.shape[0], d1.shape[0]), dtype=d0.dtype)
for i in nb.prange(d0.shape[0]):
for j in range(d1.shape[0]):
res[i, j] = np.sqrt((d0[i, 0] - d1[j, 0])**2 + (d0[i, 1] - d1[j, 1])**2)
return res
def f3(d0, d1):
print('hypot + subtract.outer')
np.hypot(
np.subtract.outer(d0[:,0], d1[:,0]),
np.subtract.outer(d0[:,1], d1[:,1])
)
if __name__ == '__main__':
s1 = time.time()
eval(f'{sys.argv[2]}(d0, d1)')
print(time.time() - s1)
(base) ~/xx@xx:~/xx$ python3 test.py 523432 f3
hypot + subtract.outer
9.79756784439087
(base) xx@xx:~/xx$ python3 test.py 213622 f2
Numba with parallel
0.3393140316009521