I am trying to find nearest neighbours for each element in a new array of points in diffrent dataset, that would be fast and not memory expensive. My bigger concern is adapted code for more neighbours rather than more dimensions.
Based on https://glowingpython.blogspot.com/2012/04/k-nearest-neighbor-search.html?showComment=1355311029556#c8236097544823362777 I have wrote k nearest neighbour search, but it is very memory extensive. In my real problem I have 1 mln values to search in and 100k points that needs to be matched, to the 1 mln x 10k array is estimated to be 600GiB.
Is there a better way?
I have tried using bisect (based on from list of integers, get number closest to a given value), but I would have to loop 100k times, which will take some time, especialy that I have o make many searches.
Good code for small datasets - able to find K nearest neighbours, and easly addaptable for many dimansions (looping by dimension):
def knn_search(search_for, search_in, K = 1,
return_col = ["ID"],
col = 'A'):
#print(col)
a_search_in = array(search_in[col])
a_search_for = array(search_for[col])
#print('a')
a = np.tile(a_search_for, [a_search_in.shape[0], 1]).T
#print('b')
b = np.tile(a_search_in, [a_search_for.shape[0], 1])
#print('tdif')
t_diff = a - b
#print('suma')
diff = np.square(t_diff)
# sorting
idx = argsort(diff)
# return the indexes of K nearest neighbours
if search_for.shape[0] == 1:
return idx[:K]
elif K == 1:
return search_in.iloc[np.concatenate(idx[:,:K]), :][return_col]
else:
tmp = pd.DataFrame()
for i in range(min(K, search_in.shape[0])):
tmp = pd.concat([tmp.reset_index(drop=True),
search_in.iloc[idx[:,i], :][[return_col]].reset_index(drop=True)],
axis=1)
return tmp
Good code for 1 dimension and 1 neighbour:
def knn_search_1K_1D(search_for, search_in,
return_col = ["ID"],
col = 'A'):
sort_search_in = search_in.sort_values(col).reset_index()
idx = np.searchsorted(sort_search_in[col], search_for[col])
idx_pop = np.where(idx > len(sort_search_in) - 1, len(sort_search_in) - 1, idx)
t = sort_search_in.iloc[idx_pop , :][[return_col]]
search_for_nn = pd.concat([search_for.add_prefix('').reset_index(drop=True),
t.add_prefix('nn_').reset_index(drop=True)],
axis=1)
Current working solution for K nearest neighbours > 1 and 1 dimension, but takes more then an hour to calculate in real case scenario mentioned above
def knn_search_nK_1D(search_for, search_in, K = 1,
return_col = ["ID"],
col = 'A'):
t = []
#looping one point by one
for i in range(search_for.shape[0]):
y = search_in[col]
x = search_for.iloc[i, :][col]
nn = np.nanmean(search_in.iloc[np.argsort(np.abs(np.subtract(y, x)))[0:K], :][return_col])
t.append(nn)
search_for_nn = search_for
search_for_nn['nn_' + return_col] = t
Example data:
search_for = pd.DataFrame({'ID': ["F", "G"],
'A' : [-1, 9]})
search_in = pd.DataFrame({'ID': ["A", "B", "C", "D", "E"],
'A' : [1, 2, 3, 4, 5 ]})
t = knn_search(search_for = search_for ,
search_in = search_in,
K = 1,
return_col = ['ID'],
col = 'A')
print(t)
# ID
#0 A
#4 E