I am new to vectorization...And this seems like a hairy problem getting it to come out using numpy rather than for loops.
I have a set of training data, and a list of queries. I need to calculate the distance between each query and every bit of the training data, and then sort for the k nearest neighbors. I can implement this fine in for loops, but speed is important. Additionally, the training data is formatted such that it is a longer list than the points coming in...I will show:
xtrain = [[0.5,0.3,0.1232141],...] #for a large number of items.
xquery = [[0.1,0.2],[0.3,0.4],...] #for a small number of items.
I need the distance as calculated by the euclidean distance between the query and the training data... so:
def distance(p1,p2):
sum_of_squares = sum([(p1[i] - p2[i])**2.0 for i in range(len(p1))]
return np.sqrt(sum_of_squares)
Then I need to sort the training data, take the k nearest, and average the remaining values in the training list...
so basically, I need a function that uses xquery and xtrain to produce an array that looks like the following:
xdist = [[distance, last_value],... (k-times)], for each value of k]
The traditional for loops would look like:
def distance(p1,p2):
sum_of_squares = sum([(p1[i] - p2[i])**2.0 for i in range(len(p1))])
return np.sqrt(sum_of_squares)
qX = data[train_rows:train_rows+5,0:-1]
k = 4
k_nearest_neighbors = [np.array(sorted([ (distance(qX[i],trainX[j]),trainX[j][-1]) for j in range(len(trainX))],key=lambda (x,y): x))[:k] for i in range(len(qX))]
predictions = [ np.average([j[1] for j in i]) for i in k_nearest_neighbors]
I kept it compact in the k_nearest neighbors step; I realize it isn't clear...but I think vectorizing from there is easier.
Anyhow, I have know idea how to do this with slices...it just seems like it should be possible...