adding a column to a dataframe from two Pandas DataFrames and currently using two loops with a conditional: Is there a faster way?

Question

I am currently looping through GPS coordinates in a dataframe. I am using this loop to look into another dataframe with GPS coordinates of specific locations and update the original dataframe with the closest location. This works fine but it is VERY slow. Is there a faster way?

Here is sample data:

imports:

from shapely.geometry import Point
import pandas as pd
from geopy import distance

Create sample df1

gps_points = [Point(37.773972,-122.431297) , Point(35.4675602,-97.5164276) , Point(42.35843, -71.05977)]
df_gps = pd.DataFrame()
df_gps['points'] = gps_points

Create sample df2

locations = {'location':['San Diego', 'Austin', 'Washington DC'],
        'gps':[Point(32.715738 , -117.161084), Point(30.267153 , -97.7430608), Point(38.89511 , -77.03637)]}
df_locations = pd.DataFrame(locations)

Two loops and update:

lst = [] #create empty list to populate new df column
for index , row in df_gps.iterrows(): # iterate over first dataframe rows
    point = row['points'] # pull out GPS point
    closest_distance = 999999 # create container for distance
    closest_location = None #create container for closest location
    for index1 , row1 in df_locations.iterrows(): # iterate over second dataframe
        name = row1['location'] # assign name of location
        point2 = row1['gps'] # assign coordinates of location
        distances = distance.distance((point.x , point.y) , (point2.x , point2.y)).miles # calculate distance
        if distances < closest_distance: # check to see if distance is closer
            closest_distance = distances # if distance is closer assign it
            closest_location = name # if distance is closer assign name
    lst.append(closest_location) # append closest city
df_gps['closest_city'] = lst # add new column with closest cities

I'd really like to do this in the fastest way possible. I have read about the vectorization of pandas and have thought about creating a function and then using apply as mentioned in How to iterate over rows in a DataFrame in Pandas however I need two loops and a conditional in my code so the pattern breaks down. Thank you for the help.

Are your dataframes are really Pandas dataframe or Geopandas dataframe with a geometry column? — Corralien, Sep 07 '21 at 21:25
just pandas. just like the sample data. The real question is optimizing the loops — kdbaseball8, Sep 07 '21 at 21:33

score 1 · Accepted Answer · answered Sep 07 '21 at 21:52

1

You can use KDTree from Scipy:

from scipy.spatial import KDTree

# Extract lat/lon from your dataframes
points = df_gps['points'].apply(lambda p: (p.x, p.y)).apply(pd.Series)
cities = df_locations['gps'].apply(lambda p: (p.x, p.y)).apply(pd.Series)

distances, indices = KDTree(cities).query(points)

df_gps['closest_city'] = df_locations.iloc[indices]['location'].values
df_gps['distance'] = distances

You can use np.where to filter out distances that are too far away.

For performance, check my answer for a similar problem with 25k rows for df_gps and 200k for df_locations.

answered Sep 07 '21 at 21:52

Corralien

109,409
8
28
52

Corralien. Thank you. Perfect response and thank you for the link to the article I wish I had found before posting. – kdbaseball8 Sep 07 '21 at 22:06
Curious though how does the KDTree deal with earth geometry? If I understand the method right it takes the coordinates into a 2-dimensional plane and then selects the nearest neighbor rather than do any true geospatial distance calculations? That is why the distance returned is not in miles or like value. Have you noticed projection errors with this method? – kdbaseball8 Sep 07 '21 at 22:12
Although the solution works in some cases it will not in all cases. This page helps https://kanoki.org/2019/12/27/how-to-calculate-distance-in-python-and-pandas-using-scipy-spatial-and-distance-functions/ – kdbaseball8 Sep 07 '21 at 22:21
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html might help those out there too. – kdbaseball8 Sep 07 '21 at 22:27
Can you check this post too: https://stackoverflow.com/a/67780643/15239951. Don't hesitate to upvote :) – Corralien Sep 07 '21 at 22:34
Thanks i'll take a look and try to post a complete solution here too. – kdbaseball8 Sep 07 '21 at 22:38
The goal is too find nearest neighbors so the distance between two points is not really false even if the projection is not right. The distortion should be minimal. – Corralien Sep 07 '21 at 22:40
If you want to use sklearn, choose BallTree instead of DistanceMetric. – Corralien Sep 07 '21 at 22:44
I'd think so as long as the points are near each other but just thinking it through straight line distance is much different than having to traverse a globe. – kdbaseball8 Sep 07 '21 at 22:45
unless we think the world is flat ;) – kdbaseball8 Sep 07 '21 at 22:47
It depends on your projection. That's why I asked you if you used GeoPandas. I guess you use `WGS84`? – Corralien Sep 07 '21 at 22:48
https://stackoverflow.com/questions/57780614/how-to-calculate-minimum-distance-using-lat-lon-data-in-python – kdbaseball8 Sep 07 '21 at 22:52
BallTree adds the haversine distance capability so I think that should work. Thank you again for all your help. – kdbaseball8 Sep 07 '21 at 22:56

kdbaseball8 · Answer 2 · 2021-09-08T15:27:00.600

1

Based on the insight of Corralien the final answer in code:

from sklearn.neighbors import BallTree, DistanceMetric

points = df_gps['points'].apply(lambda p: np.radians((p.x, p.y))).apply(pd.Series)
cities = df_locations['gps'].apply(lambda p: np.radians((p.x, p.y))).apply(pd.Series)
dist = DistanceMetric.get_metric('haversine')
tree = BallTree(cities, metric=dist)
dists, cities = tree.query(points)
df_gps['dist'] = dists.flatten() * 3956
df_gps['closest_city'] = df_locations.iloc[cities.flatten()]['location'].values

edited Sep 08 '21 at 15:27

answered Sep 07 '21 at 23:28

kdbaseball8

111
8

Nice work. I updated your post to include modules from `sklearn` to be reproducible by other users. +1 x 2 – Corralien Sep 08 '21 at 06:46
no issues I can accept yours. – kdbaseball8 Sep 16 '21 at 14:39

adding a column to a dataframe from two Pandas DataFrames and currently using two loops with a conditional: Is there a faster way?

2 Answers2

Linked