3

I have a numpy array that has 10,000 vectors with 3,000 elements in each. I want to return the top 10 indices of the closest pairs with the distance between them. So if row 5 and row 7 have the closest euclidean distance of 0.005, and row 8 and row 10 have the second closest euclidean distance of 0.0052 then I want to return [(8,10,.0052),(5,7,.005)]. The traditional for loop method is very slow. Is there an alternative quicker approach for a way to get euclidean neighbors of large features vectors (stored as np array)?

I'm doing the following:

l = []
for i in range(0,M.shape[0]): 
    for j in range(0,M.shape[0]): 
        if i != j and i > j: 
            l.append( (i,j,euc(M[i],M[j])) 
return l 

Here euc is a function to calculate euclidean distances between two vectors of a matrix using scipy. Then I sort l and pull out the top 10 closest distances

Mike El Jackson
  • 771
  • 3
  • 14
  • 23
  • Did you see [this](http://stackoverflow.com/questions/22720864/efficiently-calculating-a-euclidean-distance-matrix-using-numpy) and [this](http://stackoverflow.com/questions/22390418/pairwise-displacement-vectors-among-set-of-points)? – Paul Panzer Jan 30 '17 at 03:01
  • Possible duplicate of [How can the euclidean distance be calculated with numpy?](http://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy) – DYZ Jan 30 '17 at 03:11
  • I know how to calculate the euclidean distance and have already done so, but am looking for the fastest way to compete it between every pair of rows in an np array and then sorting it and returning the top 10 – Mike El Jackson Jan 30 '17 at 03:13
  • This is more of a speed question than how to calculate the euclidean distance. I know I can use loops or scipy – Mike El Jackson Jan 30 '17 at 03:21
  • Show your *traditional loop method* - maybe it can be improved. – wwii Jan 30 '17 at 04:20
  • I wrote it out above – Mike El Jackson Jan 30 '17 at 04:24

2 Answers2

2
def topTen(M):
    i,j = np.triu_indices(M.shape[0], 1)
    dist_sq = np.einsum('ij,ij->i', M[i]-M[j], M[i]-M[j])
    max_i=np.argpartition(dist_sq, 10)[:10]
    max_o=np.argsort(dist_sq[max_i])
    return np.vstack((i[max_i][max_o], j[max_i][max_o], dist_sq[max_i][max_o]**.5)).T

This should be pretty fast as it only does sorting and the square root on the top 10, which are the long steps (outside of the looping).

Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • I'm not really understanding the output of this but it is fast – Mike El Jackson Jan 30 '17 at 16:28
  • Say I had, M = np.array([[1,2,3],[2,3,4],[1,6,8],[1,6,9],[2,3,5]]). How would I interpret these results say if I wanted to change it it to top 8 or top 3 or such? – Mike El Jackson Jan 30 '17 at 16:34
  • OP wants top ten of the *closest* or the the ten smallest distances. – wwii Jan 31 '17 at 02:59
  • @MikeElJackson - the *```-10```* values in the ```argpartition``` call determine how many are returned. – wwii Jan 31 '17 at 03:07
  • right, but the matrix result is confusing me what is what in it? – Mike El Jackson Jan 31 '17 at 06:46
  • doh. Right,@wwii I think this is the 10 furthest away right now. I'll change it. ++also the `.T` to make the output match. – Daniel F Jan 31 '17 at 07:21
  • aaand it wasn't in order. Fixed that too! – Daniel F Jan 31 '17 at 07:37
  • @MikeElJackson - before author added the ```.T``` , ```result[0]``` and ```result[1]``` were the indices of the two vectors being compared, and ```result[2]``` were the actual distances. So ```result[:, 0]``` would be the indices of the two vectors and their distance. Now I imagine that ```result[0, :]``` gives you the two vector indices and their distance - based on @DanielForsman's remarks that is probably the info for the closest pair. – wwii Jan 31 '17 at 19:59
0

I'll post this as an answer, but I admit is not a real solution to the question, because it will only work for smaller arrays. The problem is that if you want to be really fast and avoid loops you would need to compute all the pairwise distances at once, and that implies a memory complexity in the order of the square of the input... Let's say 10,000 rows * 10,000 rows * 3,000 elems/row * 4 bytes/row (say we're using float32) ≈ 1TB (!) of memory required (actually maybe twice because you probably need a couple of arrays that size). So while it is possible, it is not practical with these kind of sizes. The following code shows how you could implement that (with sizes divided by 100).

import numpy as np

# Row length
n = 30
# Number of rows
m = 100
# Number of top elements
k = 10

# Input data
data = np.random.random((m, n))
# Tile the data in two different dimensions
data1 = np.tile(data[:, :, np.newaxis], (1, 1, m))
data2 = np.tile(data.T[np.newaxis, :, :], (m, 1, 1))
# Compute pairwise squared distances
dist = np.sum(np.square(data1 - data2), axis=1)
# Fill lower half with inf to avoid repeat and self-matching
dist[np.tril_indices(m)] = np.inf
# Find smallest distance for each row
i = np.arange(m)
j = np.argmin(dist, axis=1)
dmin = dist[i, j]
# Pick the top K smallest distances
idx = np.stack((i, j), axis=1)
isort = dmin.argsort()

# Top K indices pairs (K x 2 matrix)
top_idx = idx[isort[:k], :]
# Top K smallest distances
top_dist = np.sqrt(dmin[isort[:k]])
jdehesa
  • 58,456
  • 7
  • 77
  • 121