2

I have a latitude and longitude stored in a pandas dataframe (df) with filler spots as NaN for stop_id, stoplat, stoplon, and in another dataframe areadf, which contains more lats/lons and an arbitrary id; this is the information that is to be populated into df.

I'm trying to connect the two so that the stop columns in df contain information about the stop closest to that lat/lon point, or leave it as NaN if there is no stop within a radius R of the point.

Right now my code is as follows, but it takes a reaaaaallly long time (>40 minutes for what I'm running at the moment, before changing area to a df and using itertuples; not sure of what magnitude of difference this will make?) as there are thousands of lat/lon points and stops for each set of data, which is a problem because I need to run this on multiple files. I'm looking for suggestions to make it run faster. I've already made some very minor improvements (e.g. moving to a dataframe, using itertuples instead of iterrows, defining lats and lons outside of the loop to avoid having to retrieve it from df on every loop) but I'm out of ideas for speeding it up. getDistance uses the Haversine formula as defined to get the distance between the stop sign and the given lat,lon point.

import pandas as pd
from math import cos, asin, sqrt

R=5
lats = df['lat']
lons = df['lon']
for stop in areadf.itertuples():
    for index in df.index:
        if getDistance(lats[index],lons[index],
                       stop[1],stop[2]) < R:
            df.at[index,'stop_id'] = stop[0] # id
            df.at[index,'stoplat'] = stop[1] # lat
            df.at[index,'stoplon'] = stop[2] # lon

def getDistance(lat1,lon1,lat2,lon2):
    p = 0.017453292519943295     #Pi/180
    a = (0.5 - cos((lat2 - lat1) * p)/2 + cos(lat1 * p) * 
         cos(lat2 * p) * (1 - cos((lon2 - lon1) * p)) / 2)
    return 12742 * asin(sqrt(a)) * 100

Sample data:

df
lat        lon         stop_id    stoplat    stoplon
43.657676  -79.380146  NaN        NaN        NaN
43.694324  -79.334555  NaN        NaN        NaN

areadf
stop_id    stoplat    stoplon
0          43.657675  -79.380145
1          45.435143  -90.543253

Desired:

df
lat        lon         stop_id    stoplat    stoplon
43.657676  -79.380146  0          43.657675  -79.380145
43.694324  -79.334555  NaN        NaN        NaN
amper
  • 31
  • 1
  • 5
  • you could use pypy instead of cython, pypy compiles to c to speed up for loops in python – rachid el kedmiri Oct 04 '17 at 19:41
  • 3
    1. Dont iterate over the dataframes like that, take advantage of pandas 2. use Euclidean distance as a first pass, and pull out a few closest points as it's cheaper than Haversine 3. Subset your data into lat/lon grids, where nothing in grid x and its surrounding 8 cells is within R of anything in grid y, and run on the subset stops vs points. – jeremycg Oct 04 '17 at 19:46
  • @jeremycg Do you have any functions that you suggest I look into for better taking advantage of pandas? Thank you for your reply! – amper Oct 04 '17 at 20:02
  • 2
    Here is a video from this years pycon, where the presenter optimized nearly this exact function in pandas - https://www.youtube.com/watch?v=HN5d490_KKk and the code is here - https://github.com/sversh/pycon2017-optimizing-pandas – jeremycg Oct 04 '17 at 20:04
  • 1
    @jeremycg That was really helpful and interesting, thank you :) – amper Oct 05 '17 at 14:59

1 Answers1

1

One way would be to use the numpy haversine function from here, just slightly modified so that you can account for the radius you want.

The just iterate through your df with apply and find the closest value within a given radius

def haversine_np(lon1, lat1, lon2, lat2,R):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)
    All args must be of equal length.    
    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    if km.min() <= R:
        return km.argmin()
    else:
        return -1

df['dex'] = df[['lat','lon']].apply(lambda row: haversine_np(row[1],row[0],areadf.stoplon.values,areadf.stoplat.values,1),axis=1)

Then merge the two dataframes.

df.merge(areadf,how='left',left_on='dex',right_index=True).drop('dex',axis=1)

         lat        lon  stop_id    stoplat    stoplon
0  43.657676 -79.380146      0.0  43.657675 -79.380145
1  43.694324 -79.334555      NaN        NaN        NaN

NOTE: If you choose to follow this method, you must be sure that both dataframes indexes are reset or that they are sequentially ordered from 0 to total len of df. So be sure to reset the indexes before you run this.

df.reset_index(drop=True,inplace=True)
areadf.reset_index(drop=True,inplace=True) 
DJK
  • 8,924
  • 4
  • 24
  • 40
  • 1
    This worked great for speeding up the algorithm! From hours to seconds, and also a few seconds faster than the approach I had implemented using the pycon optimization mentioned above. Thank you very much! – amper Oct 05 '17 at 15:02