calculating average distance of nearest neighbours in pandas dataframe

Question

I have a set of objects and their positions over time. I would like to get the distance between each car and their nearest neighbour, and calculate an average of this for each time point. An example dataframe is as follows:

 time = [0, 0, 0, 1, 1, 2, 2]
 x = [216, 218, 217, 280, 290, 130, 132]
 y = [13, 12, 12, 110, 109, 3, 56]
 car = [1, 2, 3, 1, 3, 4, 5]
 df = pd.DataFrame({'time': time, 'x': x, 'y': y, 'car': car})
 df

         x       y      car
 time
  0     216     13       1
  0     218     12       2
  0     217     12       3
  1     280     110      1
  1     290     109      3
  2     130     3        4
  2     132     56       5

For each time point, I would like to know the nearest car neighbour for each car. Example:

df2

          car    nearest_neighbour    euclidean_distance  
 time
  0       1            3                    1.41
  0       2            3                    1.00
  0       3            1                    1.41
  1       1            3                    10.05
  1       3            1                    10.05
  2       4            5                    53.04
  2       5            4                    53.04

I know I can caluclate the pairwise distances between cars from How to apply euclidean distance function to a groupby object in pandas dataframe? but how do I get the nearest neighbour for each car?

After that it seems simple enough to get an average of the distances for each frame using groupby, but its the second step that really throws me off. Help appreciated!

Possible duplicate of [How to apply euclidean distance function to a groupby object in pandas dataframe?](https://stackoverflow.com/questions/51064346/how-to-apply-euclidean-distance-function-to-a-groupby-object-in-pandas-dataframe) — Haleemur Ali, Jul 12 '18 at 12:49
Hi, I used the same example, but I'm trying to ask a different question here. — UserR6, Jul 12 '18 at 12:54
ah, it wasn't clear to me what the difference is between this question and the other. the final desired output looks exactly the same. please edit your question. removing the close vote. — Haleemur Ali, Jul 12 '18 at 12:57

Bacon · Accepted Answer · 2018-07-16T17:34:32.027

It might be a bit overkill but you could use nearest neighbors from scikit

An example:

import numpy as np 
from sklearn.neighbors import NearestNeighbors
import pandas as pd

def nn(x):
    nbrs = NearestNeighbors(n_neighbors=2, algorithm='auto', metric='euclidean').fit(x)
    distances, indices = nbrs.kneighbors(x)
    return distances, indices

time = [0, 0, 0, 1, 1, 2, 2]
x = [216, 218, 217, 280, 290, 130, 132]
y = [13, 12, 12, 110, 109, 3, 56] 
car = [1, 2, 3, 1, 3, 4, 5]
df = pd.DataFrame({'time': time, 'x': x, 'y': y, 'car': car})

#This has the index of the nearest neighbor in the group, as well as the distance
nns = df.drop('car', 1).groupby('time').apply(lambda x: nn(x.as_matrix()))

groups = df.groupby('time')
nn_rows = []
for i, nn_set in enumerate(nns):
    group = groups.get_group(i)
    for j, tup in enumerate(zip(nn_set[0], nn_set[1])):
        nn_rows.append({'time': i,
                        'car': group.iloc[j]['car'],
                        'nearest_neighbour': group.iloc[tup[1][1]]['car'],
                        'euclidean_distance': tup[0][1]})

nn_df = pd.DataFrame(nn_rows).set_index('time')

Result:

      car  euclidean_distance  nearest_neighbour
time                                            
0       1            1.414214                  3
0       2            1.000000                  3
0       3            1.000000                  2
1       1           10.049876                  3
1       3           10.049876                  1
2       4           53.037722                  5
2       5           53.037722                  4

(Note that at time 0, car 3's nearest neighbor is car 2. sqrt((217-216)**2 + 1) is about 1.4142135623730951 while sqrt((218-217)**2 + 0) = 1)

Haleemur Ali · Answer 2 · 2018-07-13T14:29:23.233

use cdist from scipy.spatial.distance to get a matrix representing distance from each car to every other car. Since each car's distance to itself is 0, the diagonal elements are all 0.

example (for time == 0):

X = df[df.time==0][['x','y']]
dist = cdist(X, X)
dist
array([[0.        , 2.23606798, 1.41421356],
       [2.23606798, 0.        , 1.        ],
       [1.41421356, 1.        , 0.        ]])

Use np.argsort to get the indexes that would sort the distance-matrix. The first column is just the row number because the diagonal elements are 0.

idx = np.argsort(dist)
idx
array([[0, 2, 1],
       [1, 2, 0],
       [2, 1, 0]], dtype=int64)

Then, just pick out the cars & closest distances using the idx

dist[v[:,0], v[:,1]]
array([1.41421356, 1.        , 1.        ])

df[df.time==0].car.values[v[:,1]]
array([3, 3, 2], dtype=int64)

combine the above logic into a function that returns the required dataframe:

 def closest(df):
     X = df[['x', 'y']]
     dist = cdist(X, X)
     v = np.argsort(dist)
     return df.assign(euclidean_distance=dist[v[:, 0], v[:, 1]],
                      nearest_neighbour=df.car.values[v[:, 1]])

& use it with groupby, finally dropping the index because the groupby-apply adds an additional index

df.groupby('time').apply(closest).reset_index(drop=True)

   time    x    y  car  euclidean_distance  nearest_neighbour
0     0  216   13    1            1.414214                  3
1     0  218   12    2            1.000000                  3
2     0  217   12    3            1.000000                  2
3     1  280  110    1           10.049876                  3
4     1  290  109    3           10.049876                  1
5     2  130    3    4           53.037722                  5
6     2  132   56    5           53.037722                  4

by the way your sample output is wrong for time 0. My answer & Bacon's answer both show the correct result

calculating average distance of nearest neighbours in pandas dataframe

2 Answers2

Linked