0

Consider this python code, where I try to compute the eucliean distance of a vector to every row of a matrix. It's very slow compared to the best Julia version I can find using Tullio.jl.

The python version takes 30s but the Julia version only takes 75ms.

I am sure I am not doing the best in Python. Are there faster solutions? Numba and numpy solutions welcome.

import numpy as np

# generate
a = np.random.rand(4000000, 128)

b = np.random.rand(128)

print(a.shape)
print(b.shape)



def lin_norm_ever(a, b):
    return np.apply_along_axis(lambda x: np.linalg.norm(x - b), 1, a)


import time
t = time.time()
res = lin_norm_ever(a, b)
print(res.shape)
elapsed = time.time() - t
print(elapsed)

The Julia verions

using Tullio
function comp_tullio(a, c)
    dist = zeros(Float32, size(a, 2))
    @tullio dist[i] = (c[j] - a[j,i])^2

    dist
end
@time comp_tullio(a, c)

@benchmark comp_tullio(a, c) # 75ms on my computer
xiaodai
  • 14,889
  • 18
  • 76
  • 140
  • Additionally, if you use approach #5 from linked Q&A's answer, use `np.sqrt` on the distances. With approach #2, `cdist(matrix, np.atleast_2d(search_vec)).ravel()`. – Divakar Aug 21 '20 at 06:15
  • 1
    Have you checked out the linked Q&A? approach #5 and approach #2 answer the question. If not, what am I missing? Would love to re-open if you could justify those don't work for you. – Divakar Aug 21 '20 at 07:08
  • On my (very new, 8 core) computer `comp_tullio` takes >400ms. How many threads are you running? – DNF Aug 21 '20 at 09:32
  • 1
    6 threads I got and I got an i7 from Jsut over one year ago – xiaodai Aug 21 '20 at 09:32
  • I am on Julia 1.5 on Windows 10. – xiaodai Aug 21 '20 at 09:33
  • I changed `dist` from `Float32` to `Float64`, and pre-allocated with `undef` instead of zeros. The time dropped to 135ms. – DNF Aug 21 '20 at 09:34
  • Oops. Initializing with `undef` is not a good idea. But an ordinary threaded loop matches Tullio for me on this example. The tullio syntax is nicer, though. – DNF Aug 21 '20 at 09:46

1 Answers1

3

I would use Numba in this example for best performance. I also added 2 approaches from Divakars linked answer for comparison.

Code

import numpy as np
import numba as nb
from scipy.spatial.distance import cdist

@nb.njit(fastmath=True,parallel=True,cache=True)
def dist_1(mat,vec):
    res=np.empty(mat.shape[0],dtype=mat.dtype)
    for i in nb.prange(mat.shape[0]):
        acc=0
        for j in range(mat.shape[1]):
            acc+=(mat[i,j]-vec[j])**2
        res[i]=np.sqrt(acc)
    return res

#from https://stackoverflow.com/a/52364284/4045774
def dist_2(mat,vec):
    return cdist(mat, np.atleast_2d(vec)).ravel()

#from https://stackoverflow.com/a/52364284/4045774
def dist_3(mat,vec):
    M = mat.dot(vec)
    d = np.einsum('ij,ij->i',mat,mat) + np.inner(vec,vec) -2*M
    return np.sqrt(d)

Timings

#Float64
a = np.random.rand(4000000, 128)
b = np.random.rand(128)
%timeit dist_1(a,b)
#122 ms ± 3.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit dist_2(a,b)
#484 ms ± 3.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit dist_3(a,b)
#432 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#Float32
a = np.random.rand(4000000, 128).astype(np.float32)
b = np.random.rand(128).astype(np.float32)
%timeit dist_1(a,b)
#68.6 ms ± 414 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit dist_2(a,b)
#2.2 s ± 32.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#looks like there is a costly type-casting to float64
%timeit dist_3(a,b)
#228 ms ± 8.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
max9111
  • 6,272
  • 1
  • 16
  • 33