- I have a Pandas DataFrame of 2 million entries
- Each entry is a point in a 100 dimensional space
- I want to compute the Euclidian distance between the N last points and all the others to find the closest neighbors (to simplify let's say find the top#1 closest neighbor for the 5 last points)
- I have done the code below for a small dataset, but it's fairly slow, and I'm looking for ideas of improvement (especially speed improvement!)
The logic is the following:
- Split the dataframe between target for which we want to find the closest neighbor and compare : all others among which we will look for the neighbor
- Iterate through the targets
- Compute the Squared Euclidean distance of each df_compare point VS the target
- Select the top#1 value of the compare df and save its ID in the target dataframe
import pandas as pd
import numpy as np
data = {'Name': ['Ly','Gr','Er','Ca','Cy','Sc','Cr','Cn','Le','Cs','An','Ta','Sa','Ly','Az','Sx','Ud','Lr','Si','Au','Co','Ck','Mj','wa'],
'dim0': [33,-9,18,-50,39,-23,-19,89,-74,81,8,23,-63,-62,-14,45,39,-46,74,19,7,97,-29,71,],
'dim1': [-7,75,77,-93,-89,4,-96,-64,41,-27,-87,23,-69,-77,-92,18,21,27,-76,-57,-44,20,15,-76,],
'dim2': [-31,54,-14,-93,72,-14,65,44,-88,19,48,-51,-25,36,-46,98,8,0,53,-47,-29,95,65,-3,],
'dim3': [-12,-86,10,93,-79,-55,-6,-79,-12,66,-81,-14,44,84,9,-19,-69,29,-50,-59,35,-28,90,-73,],
}
df = pd.DataFrame(data)
df_target = df.tail(5)
df_target['closest_neighbour'] = np.nan
df_compare= df.drop(df.tail(5).index)
for i, target_row in df_target.iterrows():
df_compare['distance'] = 0
for dim in df_target.columns:
if dim.startswith('dim'):
df_compare['distance'] = df_compare['distance'] + (target_row[dim] - df_compare[dim])**2
df_compare.sort_values(by=['distance'], ascending=True, inplace=True)
closest_neighbor=df_compare.head(1)
df_target.loc[df_target.index==i,'closest_neighbour']= closest_neighbor['Name'].iloc[0]
print(df_target)
Any suggestion of improvement of the logic or the code is welcome! Cheers