0

I want calculate the distance from each building in df1 to each city in df2. The issue is that I have ~30 rows in df1 and ~30,000 rows in df2.

The desired output is set up in output_df.

How do I go about this? Is it possible using Geopy or would that take too long?

df1 = pd.DataFrame({'Building': ['One World Trade Center', 'Central Park Tower','Willis Tower', '111 West 57th Street', 'One Vanderbilt'],
                 'Latitude': [40.713005, 40.765957, 41.878872, 40.764760, 40.752971], 
                 'Longitude': [-74.013190, -73.980844, -87.635908, -73.977581, -73.978541],
                 })
df2 = pd.DataFrame({'City': ['Santa Barbra, CA', 'Washington, D.C.'],
                 'Latitude': [34.024212, 38.9072], 
                 'Longitude': [-118.496475, -77.0369],
                 })
output_df = pd.DataFrame({'City': ['Santa Barbra', 'Santa Barbra', 'Santa Barbra', 'Santa Barbra', 'Santa Barbra', 'Washington D.C.', 'Washington D.C.', 'Washington D.C.', 'Washington D.C.', 'Washington D.C.'], 
                 'Building': ['One World Trade Center', 'Central Park Tower', 'Willis Tower', '111 West 57th Street', 'One Vanderbilt', 'One World Trade Center', 'Central Park Tower', 'Willis Tower', '111 West 57th Street', 'One Vanderbilt'],
                 'Latitude': [40.713005, 40.765957, 41.878872, 40.764760, 40.752971, 40.713005, 40.765957, 41.878872, 40.764760, 40.752971], 
                 'Longitude': [-74.013190, -73.980844, -87.635908, -73.977581, -73.978541, -74.013190, -73.980844, -87.635908, -73.977581, -73.978541],
                 'Distance': ['dis_to_SB', 'dis_to_SB', 'dis_to_SB', 'dis_to_SB', 'dis_to_SB', 'dis_to_DC', 'dis_to_DC', 'dis_to_DC', 'dis_to_DC', 'dis_to_DC']})

output_df.set_index(['City', 'Building'])
092374
  • 15
  • 1
  • 6
  • Do you need a formula for calculating the distance, or do you already have one, and just need to make it faster? – dan04 Jun 14 '22 at 16:47
  • If you can accept the haversine formula approximation (looks reasonable here), the [haversine](https://pypi.org/project/haversine/) module from PyPI can directly provide the cross distances between 2 numpy vectors. – Serge Ballesta Jun 14 '22 at 16:50
  • I don't already have a formula, I was trying to use the distance module from Geopy. I'll look into that PyPI module! – 092374 Jun 14 '22 at 16:56

3 Answers3

2

Using distance() from GeoPy:

from geopy import distance

pd.merge(df2.rename(columns={'Latitude':'C-Lat','Longitude':'C-Lon'}), df1, how='cross') \
    .assign(Distance=lambda r: \
        r.apply(lambda x: distance.distance((x['C-Lat'],x['C-Lon']),(x['Latitude'],x['Longitude'])).miles, axis=1)
           ).drop(columns=['C-Lat','C-Lon'])

               City                Building   Latitude  Longitude     Distance
0  Santa Barbra, CA  One World Trade Center  40.713005 -74.013190  2464.602573
1  Santa Barbra, CA      Central Park Tower  40.765957 -73.980844  2466.054087
2  Santa Barbra, CA            Willis Tower  41.878872 -87.635908  1759.257288
3  Santa Barbra, CA    111 West 57th Street  40.764760 -73.977581  2466.230348
4  Santa Barbra, CA          One Vanderbilt  40.752971 -73.978541  2466.233832
5  Washington, D.C.  One World Trade Center  40.713005 -74.013190   203.461336
6  Washington, D.C.      Central Park Tower  40.765957 -73.980844   207.017141
7  Washington, D.C.            Willis Tower  41.878872 -87.635908   595.065660
8  Washington, D.C.    111 West 57th Street  40.764760 -73.977581   207.103384
9  Washington, D.C.          One Vanderbilt  40.752971 -73.978541   206.571970
jch
  • 3,600
  • 1
  • 15
  • 17
  • Do you not need an API to just use the distance module? I've been using Nominatim which limits 1 request per second. – 092374 Jun 14 '22 at 20:38
  • @092374 Not an API. It is part of GeoPy. Install `conda install -c conda-forge geopy` or `pip install geopy` . Then `from geopy import distance` to use it. – jch Jun 14 '22 at 20:46
1

Hope this is what you are looking for:

from numpy import radians, cos, sin, sqrt from numpy import arcsin as asin

df1 = pd.DataFrame({'Building': ['One World Trade Center', 'Central Park Tower','Willis Tower', '111 West 57th Street', 'One Vanderbilt'],
                 'Latitude': [40.713005, 40.765957, 41.878872, 40.764760, 40.752971], 
                 'Longitude': [-74.013190, -73.980844, -87.635908, -73.977581, -73.978541],
                 })

df2 = pd.DataFrame({'City': ['Santa Barbra, CA', 'Washington, D.C.'],
                 'Latitude': [34.024212, 38.9072], 
                 'Longitude': [-118.496475, -77.0369],
                 })

Full join

df1['key'] = 1
df2['key'] = 1
df = pd.merge(df1, df2, on = 'key')

Distance between 2 cities

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    # Radius of earth in kilometers is 6371
    km = 6371* c
    return km

df['distance'] = df.apply(lambda x:haversine(x['Latitude_x'], x['Longitude_x'], x['Latitude_y'], x['Longitude_y']) * 0.90, axis = 1)

print(df)

The distance now is in km. The haversine function I took from this post.

Beta
  • 1,638
  • 5
  • 33
  • 67
  • 2
    Caution: the Euclidean distance formula that you have provided doesn't take into account the curvature of the earth. Haversine is better. – jch Jun 14 '22 at 17:28
  • I'm getting some weird outputs in the distance column. What unit of measurement is that outputting? For example, I'm getting 44.983359 for the distance between World Trade Center and Santa Barbra – 092374 Jun 14 '22 at 17:35
  • Ah, maybe that's why I'm getting those outputs @jch – 092374 Jun 14 '22 at 17:36
  • @092374 Right - what you are getting is a distance in units of latitude and longitude which is meaningless. – jch Jun 14 '22 at 17:40
  • Fun merge - I think this is now available as `how="cross"`? And as usual be wary that it creates a very big result. – creanion Jun 14 '22 at 18:25
  • @creanion: The required output table looks like `cross` join. – Beta Jun 14 '22 at 18:28
  • Something like `from numpy import radians, cos, sin, sqrt from numpy import arcsin as asin` and it runs – creanion Jun 14 '22 at 18:35
  • Slightly off using this method vs Geopy. For example World Trade Center to Washington, D.C.. When using Geopy I get 203.461361 miles or 327.439321 km. Using this formula I get 190.087442 or 305.916179 km. Is one more accurate than the other? – 092374 Jun 14 '22 at 20:37
  • Yeah... Another answer here uses almost the same logic. The advantage of this methods is one is not installing new packages. Also, if you convert miles into kms, the answer is the same. – Beta Jun 14 '22 at 20:54
1

Assume for simplicity that Earth is a perfect sphere with radius r.

Let θ = longitude (east of Greenwich, London) and φ = latitude (north of the equator). You can then transform the coordinates into rectangular coordinates as follows:

  • x = r cos(θ) cos(φ)
  • y = r sin(θ) cos(φ)
  • z = r sin(φ)

If you have two of these longitude/latitude vectors, then the dot product of the two is:

x1x2 + y1y2 + z1z2

= r2cos(θ1)cos(φ1)cos(θ2)cos(φ2) + r2sin(θ1)cos(φ1)sin(θ2)cos(φ2) + r2sin(φ1)sin(φ2)

= r2(cos(θ1)cos(φ1)cos(θ2)cos(φ2) + sin(θ1)cos(φ1)sin(θ2)cos(φ2) + sin(φ1)sin(φ2))

= r2(cos(φ1)cos(φ2)(cos(θ1)cos(θ2) + sin(θ1)sin(θ2)) + sin(φ1)sin(φ2))

= r2(cos(φ1)cos(φ2)cos(θ12) + sin(φ1)sin(φ2))

But the dot product is also equal to r2cos(α), where α is the angle between two vectors. Thus,

α = acos(cos(φ1)cos(φ2)cos(θ12) + sin(φ1)sin(φ2))

This gives an angle in radians. To convert to distance (along a great circle), we simply need to multiply by the radius of the earth. But because Earth isn't a perfect sphere, its radius isn't a constant, but varies between 6356.7523 km at the poles to 6378.1370 km at the equator.

For the "average" radius, I'll use the volumetric radius of 6371.0008 km, converted to the appropriate unit of measure.

# Choose one of the following:
EARTH_RADIUS = 6371.0008 # kilometers
EARTH_RADIUS = 3958.7564 # miles
EARTH_RADIUS = 3440.0652 # nautical miles

Putting it all together, here's a function for estimating the distance between two points:

import math

def geo_distance(lat1, lon1, lat2, lon2):
    # Convert all angles to radians
    lat1_r = math.radians(lat1)
    lon1_r = math.radians(lon1)
    lat2_r = math.radians(lat2)
    lon2_r = math.radians(lon2)
    # Calculate the distance
    dp = math.cos(lat1_r) * math.cos(lat2_r) * math.cos(lon1_r - lon2_r) + math.sin(lat1_r) * math.sin(lat2_r)
    angle = math.acos(dp)
    return EARTH_RADIUS * angle

And a test case for the distance between Santa Barbara, CA and Washington, DC (in miles).

>>> geo_distance(34.024212, -118.496475, 38.9072, -77.0369)
2308.3300743996306

If you need the calculation to be faster, you could pre-compute the rectangular (x, y, z) coordinates for each point, and compute the dot products based on that, so you don't need to call the trig functions as often.

dan04
  • 87,747
  • 23
  • 163
  • 198