1

I am trying to find nearest neighbours for each element in a new array of points in diffrent dataset, that would be fast and not memory expensive. My bigger concern is adapted code for more neighbours rather than more dimensions.

Based on https://glowingpython.blogspot.com/2012/04/k-nearest-neighbor-search.html?showComment=1355311029556#c8236097544823362777 I have wrote k nearest neighbour search, but it is very memory extensive. In my real problem I have 1 mln values to search in and 100k points that needs to be matched, to the 1 mln x 10k array is estimated to be 600GiB.

Is there a better way?

I have tried using bisect (based on from list of integers, get number closest to a given value), but I would have to loop 100k times, which will take some time, especialy that I have o make many searches.

Good code for small datasets - able to find K nearest neighbours, and easly addaptable for many dimansions (looping by dimension):

def knn_search(search_for, search_in, K = 1, 
               return_col = ["ID"],
               col = 'A'):
        
    
    #print(col)
    a_search_in  = array(search_in[col])
    a_search_for = array(search_for[col])
    
    #print('a')
    a = np.tile(a_search_for, [a_search_in.shape[0], 1]).T
    #print('b')
    b = np.tile(a_search_in,  [a_search_for.shape[0], 1])
    #print('tdif')
    t_diff =  a - b
        
    #print('suma')
    diff = np.square(t_diff)

    # sorting
    idx  = argsort(diff)
    
    
    # return the indexes of K nearest neighbours
    if search_for.shape[0] == 1:
        return idx[:K]
    elif K == 1:
        return search_in.iloc[np.concatenate(idx[:,:K]), :][return_col]
    else:
        tmp = pd.DataFrame()
        for i in range(min(K, search_in.shape[0])):
            tmp = pd.concat([tmp.reset_index(drop=True), 
                             search_in.iloc[idx[:,i], :][[return_col]].reset_index(drop=True)], 
                            axis=1)
        return tmp

Good code for 1 dimension and 1 neighbour:

def knn_search_1K_1D(search_for, search_in, 
           return_col = ["ID"],
           col = 'A'):
    sort_search_in = search_in.sort_values(col).reset_index()
        idx = np.searchsorted(sort_search_in[col], search_for[col])
        idx_pop = np.where(idx > len(sort_search_in) - 1, len(sort_search_in) - 1, idx)
    
    t = sort_search_in.iloc[idx_pop  , :][[return_col]]
    search_for_nn = pd.concat([search_for.add_prefix('').reset_index(drop=True), 
                             t.add_prefix('nn_').reset_index(drop=True)], 
                            axis=1)

Current working solution for K nearest neighbours > 1 and 1 dimension, but takes more then an hour to calculate in real case scenario mentioned above

def knn_search_nK_1D(search_for, search_in, K = 1, 
               return_col = ["ID"],
               col = 'A'):
    t = []
    #looping one point by one 
    for i in range(search_for.shape[0]):
        y = search_in[col]
        x = search_for.iloc[i, :][col]
        nn = np.nanmean(search_in.iloc[np.argsort(np.abs(np.subtract(y, x)))[0:K], :][return_col])
        t.append(nn)
    search_for_nn = search_for
    search_for_nn['nn_' + return_col] = t

Example data:

search_for = pd.DataFrame({'ID': ["F", "G"],
                          'A' : [-1,  9]})

search_in = pd.DataFrame({'ID': ["A", "B", "C", "D", "E"],
                          'A' : [1,    2,   3,   4,   5 ]})



t = knn_search(search_for = search_for , 
               search_in  = search_in,
               K = 1, 
               return_col = ['ID'],
               col = 'A')
print(t)
#  ID
#0  A
#4  E
AAAA
  • 461
  • 6
  • 22

1 Answers1

1

Do you want to have your own implementation ? if so you could use k-d tree within KNN, it's much more efficient, otherwise, you could use KNN library support GPU such knn_cuda


Update

You could try, cuml.

4.Pi.n
  • 1,151
  • 6
  • 15
  • I do not need my implementation. As far as I know, I can not use K-D trees as i am looking for points from dataset A in dataset B and k-d trees are good for looking for neighbours inside one dataset. I will have a look at the other mentioned library tomorrow – AAAA Sep 09 '20 at 21:16
  • `knn_cuda` would be great, but it has large scale issue and does not work in so many poins as I need – AAAA Sep 11 '20 at 13:22