0

I am new to vectorization...And this seems like a hairy problem getting it to come out using numpy rather than for loops.

I have a set of training data, and a list of queries. I need to calculate the distance between each query and every bit of the training data, and then sort for the k nearest neighbors. I can implement this fine in for loops, but speed is important. Additionally, the training data is formatted such that it is a longer list than the points coming in...I will show:

 xtrain = [[0.5,0.3,0.1232141],...] #for a large number of items.

 xquery = [[0.1,0.2],[0.3,0.4],...] #for a small number of items. 

I need the distance as calculated by the euclidean distance between the query and the training data... so:

 def distance(p1,p2):
     sum_of_squares = sum([(p1[i] - p2[i])**2.0 for i in range(len(p1))]
     return np.sqrt(sum_of_squares)

Then I need to sort the training data, take the k nearest, and average the remaining values in the training list...

so basically, I need a function that uses xquery and xtrain to produce an array that looks like the following:

xdist = [[distance, last_value],... (k-times)], for each value of k]

The traditional for loops would look like:

def distance(p1,p2):
 sum_of_squares = sum([(p1[i] - p2[i])**2.0 for i in range(len(p1))])
 return np.sqrt(sum_of_squares)

qX = data[train_rows:train_rows+5,0:-1]
k = 4

k_nearest_neighbors = [np.array(sorted([ (distance(qX[i],trainX[j]),trainX[j][-1]) for j in range(len(trainX))],key=lambda (x,y): x))[:k] for i in range(len(qX))]
predictions = [ np.average([j[1] for j in i]) for i in k_nearest_neighbors]

I kept it compact in the k_nearest neighbors step; I realize it isn't clear...but I think vectorizing from there is easier.

Anyhow, I have know idea how to do this with slices...it just seems like it should be possible...

Chris
  • 28,822
  • 27
  • 83
  • 158
  • It looks like you are really just looking to find things that are most similar? Would it be acceptable to use `distance^2` meaning you simply eliminate the `np.sqrt` line? If `dist(a-b) < dist(a-c)` then `dist^2 (a-b) < dist^2 (a-c)` should also hold true (since distances are positive). That could save some time. Another suggestion might be the 2-norm http://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy __usually__ linear algebra algorithms are optimized and __should__ run better than euc dist formula, but I don't know for sure. It's worth a shot – andrew Nov 16 '15 at 22:30
  • oh yeah, good idea, thanks man – Chris Nov 16 '15 at 22:36

1 Answers1

2

It is definitely possible to do this via numpy broadcasting. It looks like this:

D = np.sum((qX[:, None, :] - trainX[None, :, :2]) ** 2, -1)
ind = np.argpartition(D, k, axis=1)[:, :k]
predictions = trainX[ind, 2].mean(1)

To confirm that this works we can define functions which implement your for-loop method and my broadcasting method, and compare the results:

def with_for_loop(qX, trainX, k):
    def distance(p1,p2):
        sum_of_squares = sum([(p1[i] - p2[i])**2.0 for i in range(len(p1))])
        return np.sqrt(sum_of_squares)

    k_nearest_neighbors = [np.array(sorted([(distance(qX[i],trainX[j]),trainX[j][-1])
                                            for j in range(len(trainX))],key=lambda t: t[0]))[:k]
                           for i in range(len(qX))]
    return [np.average([j[1] for j in i])
            for i in k_nearest_neighbors]

def with_broadcasting(qX, trainX, k):
    D = np.sum((qX[:, None, :] - trainX[None, :, :2]) ** 2, -1)
    ind = np.argpartition(D, k, axis=1)[:, :k]
    return trainX[ind, 2].mean(1)

# Test the results:
np.random.seed(0)
trainX = np.random.rand(100, 3)
qX = np.random.rand(50, 2)

np.allclose(with_for_loop(qX, trainX, 4),
            with_broadcasting(qX, trainX, 4))
# True

Keep in mind that as your data grows, it will be much more efficient to find the nearest neighbors using a tree-based method like scipy.spatial.cKDTree:

from scipy.spatial import cKDTree

def with_kd_tree(qX, trainX, k):
    dist, ind = cKDTree(trainX[:, :2]).query(qX, k)
    return trainX[ind, 2].mean(1)

np.allclose(with_broadcasting(qX, trainX, 4),
            with_kd_tree(qX, trainX, 4))
# True

Timing the execution, we can see the substantial performance improvement of these methods with a larger data set:

np.random.seed(0)
trainX = np.random.rand(1000, 3)
qX = np.random.rand(1000, 2)

%timeit with_for_loop(qX, trainX, 4)
1 loops, best of 3: 7.16 s per loop

%timeit with_broadcasting(qX, trainX, 4)
10 loops, best of 3: 57.7 ms per loop

%timeit with_kd_tree(qX, trainX, 4)
1000 loops, best of 3: 1.61 ms per loop
jakevdp
  • 77,104
  • 11
  • 125
  • 160