1

I have a similar problem as here How to get the distance between two geographic coordinates of two different dataframes? Two dataframes:

df1 = pd.DataFrame({'id': [1,2,3],                   
                          'lat':[-23.48, -22.94, -23.22],
                          'long':[-46.36, -45.40, -45.80]})

df2 = pd.DataFrame({'id': [100,200,300],                   
                           'lat':[-28.48, -22.94, -23.22],
                           'long':[-46.36, -46.40, -45.80]})

My question is: using the solution suggested by Ben.T there, how could I add rows from df2 to df1, if a point from df2 is not near df? I think, based on that matrix with distances:

from sklearn.metrics.pairwise import haversine_distances

# variable in meter you can change
threshold = 100 # meters

# another parameter
earth_radius = 6371000  # meters

distance_matrix = (
    # get the distance between all points of each DF
    haversine_distances(
        # note that you need to convert to radiant with *np.pi/180
        X=df1[['lat','long']].to_numpy()*np.pi/180, 
        Y=df2[['lat','long']].to_numpy()*np.pi/180)
    # get the distance in meter
    *earth_radius
    # compare to your threshold
    < threshold
    # **here I want to add rows from df2 to df1 if point from df2 is NOT near df1**
    )

E.g. the output looks like this:

Output:

   id   lat       long  
    1   -23.48  -46.36    
    2   -22.94  -45.40    
    3   -23.22  -45.80    
    4   -28.48  -46.36
    5   -22.94  -46.40
Julia Koncha
  • 85
  • 1
  • 12
  • your distnace matrix is an `(len(df1) x len(df2))` matrix with True/False values indicating when the point is a shorter distance than threshold. What do you want to do with this information? What should we do if two points are close? What if none are? can you show us what you hope the resulting dataframe to look like? – Michael Delgado May 24 '22 at 21:36
  • @MichaelDelgado I updated the question with possible output. So I want to take all points from df2 which are not near df1 and add those rows to df1. If points are close I just ignore that one from df2 and only keep the point in df1 – Julia Koncha May 24 '22 at 21:44
  • oh! you want to just filter df2 to exclude points which are close to *any* point in df1, and then append the filtered data to df1 – Michael Delgado May 24 '22 at 21:46

1 Answers1

1

The distance matrix gives you a (len(df1), len(df2)) boolean array, with True indicating they are "close". You can find whether any points in df1 are close enough to each element in df2 by summarizing the matrix with any across axis 0:

In [33]: df2_has_close_point_in_df1 = distance_matrix.any(axis=0)

In [34]: df2_has_close_point_in_df1
Out[34]: array([False, False,  True])

You can then use this as a mask to filter df2. Use the bitwise negation operator ~ to reverse the True/False values (to get only the rows in df2 which are not close:

In [35]: df2.iloc[~df2_has_close_point_in_df1]
Out[35]:
    id    lat   long
0  100 -28.48 -46.36
1  200 -22.94 -46.40

This can now be joined with df1 to get a combined dataset:

In [36]: combined = pd.concat([df1, df2.iloc[~df2_has_close_point_in_df1]], axis=0)

In [37]: combined
Out[37]:
    id    lat   long
0    1 -23.48 -46.36
1    2 -22.94 -45.40
2    3 -23.22 -45.80
0  100 -28.48 -46.36
1  200 -22.94 -46.40
Michael Delgado
  • 13,789
  • 3
  • 29
  • 54