I'm working with a medium size dataset (shape=(14013L, 46L)
).
I want to smooth each sample with its knn.
I'm training my model with:
NearestNeighbors(n_neighbors, algorithm='ball_tree',
metric=sklearn.metrics.pairwise.cosine_distances)
And the smooth is as follows:
def smooth(x,nbrs,data,alpha):
"""
input:
alpha: the smoothing factor
nbrs: trained NearestNeighbors from sklearn
data: the original data
(since NearestNeighbors returns only the index and not the samples)
x: what we want to smooth
output:
smoothed x with its nearest neighbours
"""
distances, indices = nbrs.kneighbors(x)
distances = map(lambda z:abs(-z+1),distances)[0]
norm = sum(distances)
if norm == 0:
"No neighbours were found."
return x
distances = map(lambda z: (1-alpha)*z/norm ,distances)
indices = map(lambda z: data[z],indices)[0]
other = np.array([indices[i] * distances[i] for i in range(len(distances))])
z = x * alpha
z = z.reshape((1,z.shape[0]))
smoothed = sum(np.concatenate((other,z),axis=0))
return smoothed
My questions:
- How is it possible that no neighbors were found ?(I experienced it on my dataset hence the
if
condition) - The fitting ("training") takes 18 seconds, but smoothing ~1000 samples takes more than 20 minutes. I'm willing to get less accurate results, if the smoothing process will be shorter. Is it possible? which parameters should I change in order to achieve it?