5

I have a np array, X that is size 1000 x 1000 where each element is a real number. I want to find the 5 closest points for every point in each row of this np array. Here the distance metric can just be abs(x-y). I have tried to do

for i in range(X.shape[0]):
    knn = NearestNeighbors(n_neighbors=5)
    knn.fit(X[i])
    for j in range(X.shape[1])
        d = knn.kneighbors(X[i,j], return_distance=False)

However, this does not work for me and I am not sure how efficient this is. Is there a way around this? I have seen a lot of methods for comparing vectors but not any for comparing single elements. I know that I could use a for loop and loop over and find the k smallest, but this would be computationally expensive. Could a KD tree work for this? I have tried a method similar to

Finding index of nearest point in numpy arrays of x and y coordinates

However, I can not get this to work. Is there some numpy function I don't know about that could accomplish this?

Community
  • 1
  • 1
Mike El Jackson
  • 771
  • 3
  • 14
  • 23
  • What do you mean by 'closest'? Closest by value? And what is a 'point'? – DYZ Mar 16 '17 at 05:05
  • so say row r = [1,10,11,15,18,16,19,18]. For each element in r I want to find the k other elements in r that have values closest to the element we are looking at. So we look at 1 and find the k closest numbers to it. Then we look at 10 and find the k closest numbers to it then .... then 18 and find the k closest numbers to it. – Mike El Jackson Mar 16 '17 at 05:07
  • Yes it is a prediction matrix so rows are people and columns are items – Mike El Jackson Mar 16 '17 at 05:10
  • 1
    So, for each row you want to get a 1000x5 array as the result? – DYZ Mar 16 '17 at 05:14
  • @MikeElJackson Consider your `r` above, and I want to find k=2 for 18, the last value. It will find [18, 19]. Are the values enough, or do you need to know the positions as well (i.e. [4, 6], for r[4] == 18, r[6] == 19)? – kennytm Mar 16 '17 at 05:14
  • 1
    Since your metric is so simple can't you just `sort` (or `argsort`) your rows? This immediately reduces the number of candidate nearest neighbours to 2k per point where k is the 5 of your example. – Paul Panzer Mar 16 '17 at 05:30

3 Answers3

2

Construct a kdtree with scipy.spatial.cKDTree for each row of your data.

import numpy as np
import scipy.spatial


def nearest_neighbors(arr, k):
    k_lst = list(range(k + 2))[2:]  # [2,3]
    neighbors = []

    for row in arr:
        # stack the data so each element is in its own row
        data = np.vstack(row)
        # construct a kd-tree
        tree = scipy.spatial.cKDTree(data)
        # find k nearest neighbors for each element of data, squeezing out the zero result (the first nearest neighbor is always itself)
        dd, ii = tree.query(data, k=k_lst)
        # apply an index filter on data to get the nearest neighbor elements
        closest = data[ii].reshape(-1, k)
        neighbors.append(closest)
    return np.stack(neighbors)


N = 1000
k = 5
A = np.random.random((N, N))
nearest_neighbors(A, k)
Crispin
  • 2,070
  • 1
  • 13
  • 16
  • 1
    Good job, just beat me too it. I'll just add that this is ~ 6 - 7 times faster compared to a method that does `np.argpartition` on the vector from the difference between the element and the row in a loop. (~ 3 seconds verse ~ 18 seconds). I think later versions of scipy have an `n_jobs` parameter for the `tree.query` function for parallel processing across CPU cores. My version doesn't have that parameter but that may give a performance boost as well. – Lucas Currah Mar 16 '17 at 08:12
1

I'm not really sure how you want the final results. But this definitely gets you what you need.

np.random.seed([3,1415])
X = np.random.rand(1000, 1000)

Grab upper triangle indices to track every combination of points per row

x1, x2 = np.triu_indices(X.shape[1], 1)

generate an array of all distances

d = np.abs(X[:, x1] - X[:, x2])

Find the closest 5 for every row

tpos = np.argpartition(d, 5)[:, :5]

Then x1[tpos] gives the row-wise positions of the first point in the closest pairs while x2[tpos] gives the second position of the closest pairs.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
0

Here is an argsorting solution that strives to take advantage of the simple metric:

def nn(A, k):
    out = np.zeros((A.shape[0], A.shape[1] + 2*k), dtype=int)
    out[:, k:-k] = np.argsort(A, axis=-1)
    out[:, :k] = out[:, -k-1, None]
    out[:, -k:] = out[:, k, None]
    strd = stride_tricks.as_strided(
        out, strides=out.strides + (out.strides[-1],), shape=A.shape + (2*k+1,))
    delta = A[np.arange(A.shape[0])[:, None, None], strd]
    delta -= delta[..., k, None]
    delta = np.abs(delta)
    s = np.argpartition(delta,(0, k), axis = -1)[..., 1:k+1]
    inds = tuple(np.ogrid[:strd.shape[0], :strd.shape[1], :0][:2])
    res = np.empty(A.shape + (k,), dtype=int)
    res[np.arange(strd.shape[0])[:, None, None], out[:, k:-k, None],
        np.arange(k)[None, None, :]] = strd[inds + (s,)]
    return res

N = 1000
k = 5
r = 10

A = np.random.random((N, N))
# crude test
print(np.abs(A[np.arange(N)[:, None, None], res]-A[..., None]).mean())
# timings
print(timeit(lambda: nn(A, k), number=r) / r)

Output:

# 0.00150537172454
# 0.4567880852999224
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99