1

Based on this code from calculating average distance of nearest neighbours in pandas dataframe, how can I adjust it so that it returns the second and third nearest neighbor into new columns?

(Or create an adjustable parameter to define how many neighbors to return):

Sample code:

import numpy as np 
from sklearn.neighbors import NearestNeighbors
import pandas as pd

def nn(x):
    nbrs = NearestNeighbors(
        n_neighbors=2, 
        algorithm='auto', 
        metric='euclidean'
    ).fit(x)
    distances, indices = nbrs.kneighbors(x)
    return distances, indices

time = [0, 0, 0, 1, 1, 2, 2]
x = [216, 218, 217, 280, 290, 130, 132]
y = [13, 12, 12, 110, 109, 3, 56] 
car = [1, 2, 3, 1, 3, 4, 5]
df = pd.DataFrame({'time': time, 'x': x, 'y': y, 'car': car})

#This has the index of the nearest neighbor in the group, as well as the distance
nns = df.drop('car', 1).groupby('time').apply(lambda x: nn(x.as_matrix()))

groups = df.groupby('time')
nn_rows = []
for i, nn_set in enumerate(nns):
    group = groups.get_group(i)
    for j, tup in enumerate(zip(nn_set[0], nn_set[1])):
        nn_rows.append({'time': i,
                    'car': group.iloc[j]['car'],
                    'nearest_neighbour': group.iloc[tup[1][1]]['car'],
                    'euclidean_distance': tup[0][1]})

nn_df = pd.DataFrame(nn_rows).set_index('time')

Resulting dataframe:

>>> nn_df
time car euclidean_distance nearest_neighbour           
0    1   1.414214           3
0    2   1.000000           3
0    3   1.000000           2
1    1   10.049876          3
1    3   10.049876          1
2    4   53.037722          5
2    5   53.037722          4

HOW CAN I GET THE OUTPUT OF NEAREST NEIGHBOR 2 and 3 and N and insert them into new columns?

1 Answers1

2

Here's the documentation for the NearestNeighbors method.

I think your problem can be solved using the n_neighbors parameter. The parameter specifies the indices and distances of the number of nearest neighbors to return.

The generally used value is 2, when we aim to find the single nearest neighbor other than the point itself. The nearest neighbor is always itself as the distance is 0.

To find the second and third nearest neighbors, n_neighbors should be set to 4. This will return the point itself, followed by the next N-1 nearest neighbors

# Argument
n_neighbor = 4

# Indices
[point_itself, neighbor_1, neighbor_2, neighbor_3]

# Distances
[ 0, distance_1, distance_2, distance_3]
skillsmuggler
  • 1,862
  • 1
  • 11
  • 16
  • Thanks a lot! But I would need to adjust the code somewhere around "'nearest_neighbour': group.iloc[tup[1][1]]['car']," to get the values of neighbor 2 and 3, isnt it? How would i do this? –  Dec 19 '19 at 21:49
  • Yes, you have to index the neighbor from the results. It's simply `group.iloc[tup[1][nth_neighbor]]['car']`. Also you need to keep in mind that the number of neighbors possible depends on the sample size, i.e. you can have a maximum of `n` neighbors for a sample size of `n`. – skillsmuggler Dec 20 '19 at 04:54