4

I have two dataframes like below

df1

visit_counts SG_lat   SG_long
0   3222.0  33.13623    -91.942026
1   6243.0  33.241981   -92.668384
2   5225.0  33.27683    -93.212498
3   6107.0  33.461784   -94.039191
4   3712.0  33.567683   -92.83685799999999

df2

num_transactions lat_dgr    long_dgr
0   45433   35.293364   -93.716224
1   41172   35.293364   -93.716224
2   41909   35.293364   -93.716224
3   37979   35.293364   -93.716224
4   43546   35.293364   -93.716224

I want to inner join these dataframes if the geo distance between two coordinates is less than 100m like below pseudo code i.e

## pseudo code
coords_1 = (df1.SG_lat, df1.SG_long)
coords_2 = (df2.lat_dgr, df2.long_dgr)
geopy.distance.vincenty(coords_1, coords_2).m < 100

In SQL we can do this with where condition like below

ST_DISTANCE(ST_GEOGPOINT(long_dgr,lat_dgr), ST_GEOGPOINT( sg_long,sg_lat)) <= 100

pandas merge function doesn't allow for where condition. Is there any other way to join these two dataframes. I don't have any other key columns to join and then filter using loc.

krishna koti
  • 659
  • 1
  • 6
  • 10
  • do you want to compute all possible distances, or just those of common row/index? – anon01 Jun 19 '20 at 19:36
  • @anon01 i want to compute all possible distances and then join values if distance is less than 100 i.e. i am looking for an inner join where distance < 100 – krishna koti Jun 19 '20 at 20:10

1 Answers1

0

If you wish to compute the distance all row combinations across tables, you can: 1) create a cartesian product of the tables, 2) compute the distance 3) filter on your threshold. This is memory inefficient as you are expanding out all row combinations, but at least straightforward to compute:

import pandas as pd
from geopy.distance import geodesic

# create a dummy key to join all rows from df1 to df2:
df1["dummy_key"] = 0
df2["dummy_key"] = 0

# create cartesian product table
df3 = pd.merge(left=df1, right=df2, on="dummy_key").drop(columns=["dummy_key"])

# apply geodesic (newer version of geopy.distance.vincenty) to get the distance in meters for each row
dist = df3.apply(lambda row: geodesic((row["SG_lat"], row["SG_long"]), (row["lat_dgr"], row["long_dgr"])).m, axis=1) 

# filter for rows that you desire:
df3 = df3[dist < 100]
anon01
  • 10,618
  • 8
  • 35
  • 58
  • this would compute the distance for rows of common index, but actually i am combining based on distance as to find the common rows.The table has some other columns and latitude, longitude are not sorted in any order. So, i am looking for an inner join based on distance. – krishna koti Jun 19 '20 at 20:08
  • @krishnakoti makes sense, I'll update. How big are your tables, how much memory do you have? – anon01 Jun 19 '20 at 20:45
  • the tables are of medium size and both of them contains around 2k rows. – krishna koti Jun 19 '20 at 20:50
  • @krishnakoti revised answer above. Here is a simillar question/solution that discusses some other approaches: https://stackoverflow.com/questions/23508351/how-to-do-a-conditional-join-in-python-pandas – anon01 Jun 19 '20 at 21:25
  • FWIW, it looks like this has been recently discussed in conversations about the pandas API: https://github.com/pandas-dev/pandas/issues/7480 – anon01 Jun 19 '20 at 21:27