Calculating distance between a set of points using np matrix calculations

Question

I am new to vectorization...And this seems like a hairy problem getting it to come out using numpy rather than for loops.

I have a set of training data, and a list of queries. I need to calculate the distance between each query and every bit of the training data, and then sort for the k nearest neighbors. I can implement this fine in for loops, but speed is important. Additionally, the training data is formatted such that it is a longer list than the points coming in...I will show:

 xtrain = [[0.5,0.3,0.1232141],...] #for a large number of items.

 xquery = [[0.1,0.2],[0.3,0.4],...] #for a small number of items.

I need the distance as calculated by the euclidean distance between the query and the training data... so:

 def distance(p1,p2):
     sum_of_squares = sum([(p1[i] - p2[i])**2.0 for i in range(len(p1))]
     return np.sqrt(sum_of_squares)

Then I need to sort the training data, take the k nearest, and average the remaining values in the training list...

so basically, I need a function that uses xquery and xtrain to produce an array that looks like the following:

xdist = [[distance, last_value],... (k-times)], for each value of k]

The traditional for loops would look like:

def distance(p1,p2):
 sum_of_squares = sum([(p1[i] - p2[i])**2.0 for i in range(len(p1))])
 return np.sqrt(sum_of_squares)

qX = data[train_rows:train_rows+5,0:-1]
k = 4

k_nearest_neighbors = [np.array(sorted([ (distance(qX[i],trainX[j]),trainX[j][-1]) for j in range(len(trainX))],key=lambda (x,y): x))[:k] for i in range(len(qX))]
predictions = [ np.average([j[1] for j in i]) for i in k_nearest_neighbors]

I kept it compact in the k_nearest neighbors step; I realize it isn't clear...but I think vectorizing from there is easier.

Anyhow, I have know idea how to do this with slices...it just seems like it should be possible...

It looks like you are really just looking to find things that are most similar? Would it be acceptable to use `distance^2` meaning you simply eliminate the `np.sqrt` line? If `dist(a-b) < dist(a-c)` then `dist^2 (a-b) < dist^2 (a-c)` should also hold true (since distances are positive). That could save some time. Another suggestion might be the 2-norm http://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy __usually__ linear algebra algorithms are optimized and __should__ run better than euc dist formula, but I don't know for sure. It's worth a shot — andrew, Nov 16 '15 at 22:30

jakevdp · Accepted Answer · 2015-11-17T06:08:32.813

It is definitely possible to do this via numpy broadcasting. It looks like this:

D = np.sum((qX[:, None, :] - trainX[None, :, :2]) ** 2, -1)
ind = np.argpartition(D, k, axis=1)[:, :k]
predictions = trainX[ind, 2].mean(1)

To confirm that this works we can define functions which implement your for-loop method and my broadcasting method, and compare the results:

def with_for_loop(qX, trainX, k):
    def distance(p1,p2):
        sum_of_squares = sum([(p1[i] - p2[i])**2.0 for i in range(len(p1))])
        return np.sqrt(sum_of_squares)

    k_nearest_neighbors = [np.array(sorted([(distance(qX[i],trainX[j]),trainX[j][-1])
                                            for j in range(len(trainX))],key=lambda t: t[0]))[:k]
                           for i in range(len(qX))]
    return [np.average([j[1] for j in i])
            for i in k_nearest_neighbors]

def with_broadcasting(qX, trainX, k):
    D = np.sum((qX[:, None, :] - trainX[None, :, :2]) ** 2, -1)
    ind = np.argpartition(D, k, axis=1)[:, :k]
    return trainX[ind, 2].mean(1)

# Test the results:
np.random.seed(0)
trainX = np.random.rand(100, 3)
qX = np.random.rand(50, 2)

np.allclose(with_for_loop(qX, trainX, 4),
            with_broadcasting(qX, trainX, 4))
# True

Keep in mind that as your data grows, it will be much more efficient to find the nearest neighbors using a tree-based method like scipy.spatial.cKDTree:

from scipy.spatial import cKDTree

def with_kd_tree(qX, trainX, k):
    dist, ind = cKDTree(trainX[:, :2]).query(qX, k)
    return trainX[ind, 2].mean(1)

np.allclose(with_broadcasting(qX, trainX, 4),
            with_kd_tree(qX, trainX, 4))
# True

Timing the execution, we can see the substantial performance improvement of these methods with a larger data set:

np.random.seed(0)
trainX = np.random.rand(1000, 3)
qX = np.random.rand(1000, 2)

%timeit with_for_loop(qX, trainX, 4)
1 loops, best of 3: 7.16 s per loop

%timeit with_broadcasting(qX, trainX, 4)
10 loops, best of 3: 57.7 ms per loop

%timeit with_kd_tree(qX, trainX, 4)
1000 loops, best of 3: 1.61 ms per loop

Calculating distance between a set of points using np matrix calculations

1 Answers1