4

I have two DataFrames, df1 are locations of places and df2 are locations of stations. I am trying to find a more efficient way to apply a distance function to find which stations are within a certain range and return the station's name. If the distance function is a Latitude Difference of +/- 1 this is my expected outcome:

# df1
   Lat  Long 
0   30    31    
1   37    48    
2   54    62    
3   67    63     

# df2
   Station_Lat  Station_Long Station
0           30            32     ABC    
1           43            48     DEF    
2           84            87     GHI    
3           67            62     JKL    

# ....Some Code that compares df1 and df2....

# result
   Lat  Long  Station_Lat  Station_Long Station
    30    31           30            32     ABC
    67    63           67            62     JKL

I have a solution that uses a cartesian product/Cross Join to apply a function on a single DataFrame. This solution works, but I have millions of rows in a true dataset which makes a cartesian product very slow.

import pandas as pd
df1 = pd.DataFrame({'Lat' : [30, 37, 54, 67],
                    'Long' : [31, 48, 62, 63]})

df2 = pd.DataFrame({'Station_Lat' : [30, 43, 84, 67],
                    'Station_Long' : [32, 48, 87, 62],
                    'Station':['ABC', 'DEF','GHI','JKL']})

# creating a 'key' for a cartesian product
df1['key'] = 1
df2['key'] = 1

# Creating the cartesian Join
df3 = pd.merge(df1, df2, on='key')

# some distance function that returns True or False
# assuming the distance function I want is +/- 1 of two values
def some_distance_func(x,y):
    return x-y >= -1 and x-y <= 1

# applying the function to a column using vectorized approach
# https://stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c
df3['t_or_f'] =  list(map(some_distance_func,df3['Lat'],df3['Station_Lat']))

# result
print(df3.loc[df3['t_or_f']][['Lat','Long','Station_Lat','Station_Long','Station']].reset_index(drop=True))

I have also tried a looping approach with iterrows(), but that is slower than the cross join method. Is there a more pythonic/efficient way to achieve what I am looking for?

MattR
  • 4,887
  • 9
  • 40
  • 67

3 Answers3

3

You can use pd.cut function to specify proper intervals in which latitudes are contained and simply merge two dataframes to obtain the result:

bins = [(i-1,i+1) for i in df1['Lat']]
bins = [item for subbins in bins for item in subbins]

df1['Interval'] = pd.cut(df1['Lat'], bins=bins)
df2['Interval'] = pd.cut(df2['Station_Lat'], bins=bins)

pd.merge(df1,df2)

This solution is slightly faster than yours. 10.2 ms ± 201 µs per loop vs 12.2 ms ± 1.34 ms per loop.

treskov
  • 328
  • 1
  • 4
  • 17
  • I've been using this answer but for anyone stumbling upon this question you may receive an error `ValueError: bins must increase monotonically.` - it looks like this is a bug in `pandas`. It was *supposed* to be fixed in 0.25, but has not. – MattR Dec 27 '19 at 21:29
  • The way to fix the above error is to assign `bins` a third time to a sorted set. `bins=sorted(set(bins))`. in `pd.cut` the bins must be unique and increase. so sorting and turning it into a `set` fixes the issue! – MattR Dec 27 '19 at 21:42
1

Maybe it is faster:

df2= df2.sort_values("Station_Lat")

After sorting, you can use 'searchsorted":

df1["idx"]=df2.Station_Lat.searchsorted(df1.Lat)

"idx" is the 'nearest' station lat. index, or idx+1 is this. Maybe you need duplicate the last row in df2 (see the "searchsorted doc) to avoid over indexing it. The use "apply" with this custom function:

def dist(row): 
    if  abs(row.Lat-df2.loc[row.idx].Station_Lat)<=1: 
            return df2.loc[row.idx].Station 
    elif abs(row.Lat-df2.loc[row.idx+1].Station_Lat)<=1: 
            return df2.loc[row.idx+1].Station 

    return False 

df1.apply(dist,axis=1)                                                                                               

0      ABC
1    False
2    False
3      JKL
dtype: object

Edit: Because in 'dist()' it is assumed that df2.index is ordered and monotonic increasing (see: roww.idx+1), the 1st code line must be corrected:

df2= df2.sort_values("Station_Lat").reset_index(drop=True)

And 'dist()' is somewhat faster that way (but doesn't beat the Cartesian product method):

def dist(row):  
          idx=row.idx 
          lat1,lat2= df2.loc[idx:idx+1,"Station_Lat"] 
          if  abs(row.Lat-lat1)<=1:  
                 return df2.loc[idx,"Station"] 
          elif abs(row.Lat-lat2)<=1:  
                 return df2.loc[idx+1,"Station"] 
          return False 
kantal
  • 2,331
  • 2
  • 8
  • 15
  • I really like this answer, didn't know that `searchsorted` was something I could leverage! On these sample dataframes, my `list(map(some_distance_func,df3['Lat'],df3['Station_Lat']))` is still faster. mine was `60.3 µs ± 20.5 µs per loop` and this answer is `2.83 ms ± 575 µs per loop`. But I may be able to make this more efficient! – MattR Dec 26 '19 at 16:11
0

How about a lambda?

df3[df3.apply(lambda x, col1='Lat', col2='Station_Lat': x[col1]-x[col2] >= -1 and x[col1]-x[col2] <= 1, axis=1)]['Station']

Output:

0     ABC
15    JKL

Edit: Here's a second solution. (Note: This also uses abs() since >=-1 and <= 1 seems redundant.)

for i in df1.index:
    for j in df2.index:
        if abs(df1.loc[i, 'Lat'] - df2.loc[j, 'Station_Lat']) <=1:
            print(df2.loc[j, 'Station'])

Or, in list comprehension form:

df2.loc[[i for i in df1.index for j in df2.index if abs(df1.loc[i, 'Lat'] - df2.loc[j, 'Station_Lat']) <=1], 'Station']

Output:

ABC
JKL
Mark Moretto
  • 2,344
  • 2
  • 15
  • 21
  • Thanks for this answer, unfortunately this uses the DataFrame which is created by the cross join/cartesian product. That's what I'm trying to avoid. – MattR Dec 26 '19 at 16:02
  • @MattR Cool, cool. I posted another solution. No df3 needed. You can also try type hints, like `def some_distance_func(x: int, y: int) -> int: return x-y >= -1 and x-y <= 1` – Mark Moretto Dec 26 '19 at 16:45