3

I am working on a project for university, where I have two pandas dataframes:

      # Libraries
      import pandas as pd
      from geopy import distance

      # Dataframes

      df1 = pd.DataFrame({'id': [1,2,3],                   
                          'lat':[-23.48, -22.94, -23.22],
                          'long':[-46.36, -45.40, -45.80]})

       df2 = pd.DataFrame({'id': [100,200,300],                   
                           'lat':[-28.48, -22.94, -23.22],
                           'long':[-46.36, -46.40, -45.80]})

I need to calculate distances between geographic latitude and longitude coordinates between dataframes. So I used geopy. If the distance between the coordinate combination is less than a threshold of 100 meters, then I must assign the value 1 in the 'nearby' column. I made the following code:

      threshold = 100  # meters

      df1['nearby'] = 0

      for i in range(0, len(df1)):
          for j in range(0, len(df2)):

              coord_geo_1 = (df1['lat'].iloc[i], df1['long'].iloc[i])
              coord_geo_2 = (df2['lat'].iloc[j], df2['long'].iloc[j])

              var_distance = (distance.distance(coord_geo_1, coord_geo_2).km) * 1000 

              if(var_distance < threshold):
                   df1['nearby'].iloc[i] = 1

Although a warning appears, the code is working. However, I would like to find a way to override for() iterations. It's possible?

       # Output:

       id   lat       long  nearby
        1   -23.48  -46.36    0
        2   -22.94  -45.40    0
        3   -23.22  -45.80    1

2 Answers2

2

If you can use the library scikit-learn, the method haversine_distances calculate the distance between two sets of coordinates. so you get:

from sklearn.metrics.pairwise import haversine_distances

# variable in meter you can change
threshold = 100 # meters

# another parameter
earth_radius = 6371000  # meters

df1['nearby'] = (
    # get the distance between all points of each DF
    haversine_distances(
        # note that you need to convert to radiant with *np.pi/180
        X=df1[['lat','long']].to_numpy()*np.pi/180, 
        Y=df2[['lat','long']].to_numpy()*np.pi/180)
    # get the distance in meter
    *earth_radius
    # compare to your threshold
    < threshold
    # you want to check if any point from df2 is near df1
    ).any(axis=1).astype(int)

print(df1)

#    id    lat   long  nearby
# 0   1 -23.48 -46.36       0
# 1   2 -22.94 -45.40       0
# 2   3 -23.22 -45.80       1

EDIT: OP ask for a version with distance from geopy, so here is a way.

df1['nearby'] = (np.array(
    [[(distance.distance(coord1, coord2).km)
      for coord2 in df2[['lat','long']].to_numpy()] 
     for coord1 in df1[['lat','long']].to_numpy()]
     ) * 1000 < threshold
).any(1).astype(int)
Ben.T
  • 29,160
  • 6
  • 32
  • 54
  • Is there a way to use this logic only using the geopy library? It is a requirement of the discipline to use this library. – valentim.kodak Feb 01 '22 at 16:13
  • I made the change you suggested and it worked perfectly. Thanks! – valentim.kodak Feb 01 '22 at 19:29
  • @valentim.kodak glad it works, see the edit post, I move the `*1000 – Ben.T Feb 01 '22 at 19:52
  • The code you included in the response showed an error. The correct one would be 'cood1 and cood2' instead of 'coord_geo_1 and coord_geo_2', right? – valentim.kodak Feb 01 '22 at 20:03
  • @valentim.kodak yes you are right, wrong copy-paste – Ben.T Feb 01 '22 at 20:04
  • @Ben.T thank for this elegant solution! Could you please advice how could you add those rows from df2 to df1 if the `distance>threshold`? – Julia Koncha May 24 '22 at 20:29
  • @JuliaKoncha IIUC, you can probably do something like `pd.concat([df1, df2.loc[haversine_distances(...)*earth_radius>threshold).all(axis=0)]])` where the `...` are like in the answer above – Ben.T May 26 '22 at 13:53
1

You can cross-merge the two dfs to get a distance between each id in df1 vs df2:

dfm = pd.merge(df1, df2, how = 'cross', suffixes = ['','_2'])
dfm['dist'] = dfm.apply(lambda r: distance.distance((r['lat'],r['long']),(r['lat_2'],r['long_2'])).km * 1000 , axis=1)

dfm looks like this:

      id     lat    long    id_2    lat_2    long_2      dist
--  ----  ------  ------  ------  -------  --------  --------
 0     1  -23.48  -46.36     100   -28.48    -46.36  553941
 1     1  -23.48  -46.36     200   -22.94    -46.4    59943.4
 2     1  -23.48  -46.36     300   -23.22    -45.8    64095.5
 3     2  -22.94  -45.4      100   -28.48    -46.36  621251
 4     2  -22.94  -45.4      200   -22.94    -46.4   102568
 5     2  -22.94  -45.4      300   -23.22    -45.8    51393.4
 6     3  -23.22  -45.8      100   -28.48    -46.36  585430
 7     3  -23.22  -45.8      200   -22.94    -46.4    68854.7
 8     3  -23.22  -45.8      300   -23.22    -45.8        0

you can test column 'dist' to be below the treshold, but if the requirement is to aggregate by id from df1 then you can do for example

res = df1.merge(dfm.groupby('id').apply(lambda g:any(g['dist'] < threshold)*1).rename('nearby'), on = 'id')

res now looks like this:

      id     lat    long    nearby
--  ----  ------  ------  --------
 0     1  -23.48  -46.36         0
 1     2  -22.94  -45.4          0
 2     3  -23.22  -45.8          1
piterbarg
  • 8,089
  • 2
  • 6
  • 22
  • The code works perfectly! The idea of ​​storing the distances between coordinate combinations is great. I would like to know if it would be possible to keep in the dfm only the combination of 'ids' with the smallest distance? – valentim.kodak Feb 01 '22 at 13:41
  • @valentim.kodak the smallest among all pairs is `dfm.nsmallest(1,'dist')`. If the smallest by `id` then `dfm.groupby('id').apply(lambda g: g.nsmallest(1,'dist'))` – piterbarg Feb 01 '22 at 13:48
  • When I run on the full dataset the following error appears: KeyError: 'cross' on line: dfm = pd.merge(df1, df2, how = 'cross', suffixes = ['','_2']) – valentim.kodak Feb 01 '22 at 14:28
  • @valentim.kodak so can you confirm that it runs on your example dataframes but not on your full dataframes? (this is pretty weird) – piterbarg Feb 01 '22 at 15:11
  • In the example dataframes it works perfectly, but in the original dataframe the error appears: KeyError: 'cross' – valentim.kodak Feb 01 '22 at 15:52
  • could you do `df.head(10).to_dict()` for the original dataframes and add output to your question. – piterbarg Feb 01 '22 at 16:25
  • Yes, I will test with a part of the original dataframe. Thanks! – valentim.kodak Feb 01 '22 at 19:56
  • I did a new run. I used 'df1' with size 10 and 'df2' with size 3. I'm running the original version inside Databricks. The following error appears: MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False on line dfm = pd.merge(df1, df2, how = 'cross', suffixes = ['','_2']) – valentim.kodak Feb 01 '22 at 20:14
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/241628/discussion-between-valentim-kodak-and-piterbarg). – valentim.kodak Feb 01 '22 at 21:01