I am currently looping through GPS coordinates in a dataframe. I am using this loop to look into another dataframe with GPS coordinates of specific locations and update the original dataframe with the closest location. This works fine but it is VERY slow. Is there a faster way?
Here is sample data:
imports:
from shapely.geometry import Point
import pandas as pd
from geopy import distance
Create sample df1
gps_points = [Point(37.773972,-122.431297) , Point(35.4675602,-97.5164276) , Point(42.35843, -71.05977)]
df_gps = pd.DataFrame()
df_gps['points'] = gps_points
Create sample df2
locations = {'location':['San Diego', 'Austin', 'Washington DC'],
'gps':[Point(32.715738 , -117.161084), Point(30.267153 , -97.7430608), Point(38.89511 , -77.03637)]}
df_locations = pd.DataFrame(locations)
Two loops and update:
lst = [] #create empty list to populate new df column
for index , row in df_gps.iterrows(): # iterate over first dataframe rows
point = row['points'] # pull out GPS point
closest_distance = 999999 # create container for distance
closest_location = None #create container for closest location
for index1 , row1 in df_locations.iterrows(): # iterate over second dataframe
name = row1['location'] # assign name of location
point2 = row1['gps'] # assign coordinates of location
distances = distance.distance((point.x , point.y) , (point2.x , point2.y)).miles # calculate distance
if distances < closest_distance: # check to see if distance is closer
closest_distance = distances # if distance is closer assign it
closest_location = name # if distance is closer assign name
lst.append(closest_location) # append closest city
df_gps['closest_city'] = lst # add new column with closest cities
I'd really like to do this in the fastest way possible. I have read about the vectorization of pandas and have thought about creating a function and then using apply as mentioned in How to iterate over rows in a DataFrame in Pandas however I need two loops and a conditional in my code so the pattern breaks down. Thank you for the help.