9

I got a dataframe that contains places with their latitude and longitude. Imagine for example cities.

df = pd.DataFrame([{'city':"Berlin", 'lat':52.5243700, 'lng':13.4105300},
                   {'city':"Potsdam", 'lat':52.3988600, 'lng':13.0656600},
                   {'city':"Hamburg", 'lat':53.5753200, 'lng':10.0153400}]);

Now I'm trying to get all cities in a radius around another. Let's say all cities in a distance of 500km from Berlin, 500km from Hamburg and so on. I would do this by duplicating the original dataframe and joining both with a distance-function.

The intermediate result would be somewhat like this:

Berlin --> Potsdam
Berlin --> Hamburg
Potsdam --> Berlin
Potsdam --> Hamburg
Hamburg --> Potsdam
Hamburg --> Berlin

This final result after grouping (reducing) should be like this. Remark: Would be cool if the list of values includes all columns of the city.

Berlin --> [Potsdam, Hamburg]
Potsdam --> [Berlin, Hamburg]
Hamburg --> [Berlin, Potsdam]

Or just the count of cities 500km around one city.

Berlin --> 2
Potsdam --> 2
Hamburg --> 2

Since I'm quite new to Python, I would appreciate any starting point. I'm familiar with haversine distance. But not sure if there are useful distance/spatial methods in Scipy or Pandas.

Glad if you can give me a starting point. Up to now I tried following this post.

Update: The original idea behind this question comes from the Two Sigma Connect Rental Listing Kaggle Competition. The idea is to get those listing 100m around another listing. Which a) indicates a density and therefore a popular area and b) if the addresses are compares, you can find out if there is a crossing and therefore a noisy area. Therefore you not need the full item to item relation since you need to compare not only the distance but also the address and other meta-data. PS: I'm not uploading a solution to Kaggle. I just want to learn.

Community
  • 1
  • 1
Matthias
  • 5,574
  • 8
  • 61
  • 121
  • 1
    Haversine doesn't really belong in `scipy` or `pandas`. Define a function in the code. This one seems like it should work: http://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points – Denziloe Mar 18 '17 at 18:06
  • "I'm familiar with haversine distance" -- no such animal; haversine(x) is a trig function like sine(x), cosine(x) etc. There are several formulas for calculation of a great-circle distance between 2 points; three of these formulas use the haversine(x) function explicitly or implicitly. – John Machin Mar 19 '17 at 22:02

2 Answers2

12

You can use:

from math import radians, cos, sin, asin, sqrt

def haversine(lon1, lat1, lon2, lat2):

    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

First need cross join with merge, remove row with same values in city_x and city_y by boolean indexing:

df['tmp'] = 1
df = pd.merge(df,df,on='tmp')
df = df[df.city_x != df.city_y]
print (df)
    city_x     lat_x     lng_x  tmp   city_y     lat_y     lng_y
1   Berlin  52.52437  13.41053    1  Potsdam  52.39886  13.06566
2   Berlin  52.52437  13.41053    1  Hamburg  53.57532  10.01534
3  Potsdam  52.39886  13.06566    1   Berlin  52.52437  13.41053
5  Potsdam  52.39886  13.06566    1  Hamburg  53.57532  10.01534
6  Hamburg  53.57532  10.01534    1   Berlin  52.52437  13.41053
7  Hamburg  53.57532  10.01534    1  Potsdam  52.39886  13.06566

Then apply haversine function:

df['dist'] = df.apply(lambda row: haversine(row['lng_x'], 
                                            row['lat_x'], 
                                            row['lng_y'], 
                                            row['lat_y']), axis=1)

Filter distance:

df = df[df.dist < 500]
print (df)
    city_x     lat_x     lng_x  tmp   city_y     lat_y     lng_y        dist
1   Berlin  52.52437  13.41053    1  Potsdam  52.39886  13.06566   27.215704
2   Berlin  52.52437  13.41053    1  Hamburg  53.57532  10.01534  255.223782
3  Potsdam  52.39886  13.06566    1   Berlin  52.52437  13.41053   27.215704
5  Potsdam  52.39886  13.06566    1  Hamburg  53.57532  10.01534  242.464120
6  Hamburg  53.57532  10.01534    1   Berlin  52.52437  13.41053  255.223782
7  Hamburg  53.57532  10.01534    1  Potsdam  52.39886  13.06566  242.464120

And last create list or get size with groupby:

df1 = df.groupby('city_x')['city_y'].apply(list)
print (df1)
city_x
Berlin     [Potsdam, Hamburg]
Hamburg     [Berlin, Potsdam]
Potsdam     [Berlin, Hamburg]
Name: city_y, dtype: object

df2 = df.groupby('city_x')['city_y'].size()
print (df2)
city_x
Berlin     2
Hamburg    2
Potsdam    2
dtype: int64

Also is possible use numpy haversine solution:

def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

df['tmp'] = 1
df = pd.merge(df,df,on='tmp')
df = df[df.city_x != df.city_y]
#print (df)

df['dist'] = haversine_np(df['lng_x'],df['lat_x'],df['lng_y'],df['lat_y'])
    city_x     lat_x     lng_x  tmp   city_y     lat_y     lng_y        dist
1   Berlin  52.52437  13.41053    1  Potsdam  52.39886  13.06566   27.198616
2   Berlin  52.52437  13.41053    1  Hamburg  53.57532  10.01534  255.063541
3  Potsdam  52.39886  13.06566    1   Berlin  52.52437  13.41053   27.198616
5  Potsdam  52.39886  13.06566    1  Hamburg  53.57532  10.01534  242.311890
6  Hamburg  53.57532  10.01534    1   Berlin  52.52437  13.41053  255.063541
7  Hamburg  53.57532  10.01534    1  Potsdam  52.39886  13.06566  242.311890
Ryan R
  • 8,342
  • 15
  • 84
  • 111
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Nice starting point. But imagine the dataframe contains millions of locations. Is there a way to apply the 500km-haversine during the join? So that not each item is joined with every other item but only with those who are in the desired range. – Matthias Mar 18 '17 at 18:12
  • I think it is problem, because dont know distance befory apply function haversine. :( – jezrael Mar 18 '17 at 18:15
  • where did you get the radians variable from? inside your haversine function? ah, found it in math.radians – Matthias Mar 18 '17 at 18:58
  • it is `math` function, need `from math import radians, cos, sin, asin, sqrt ` – jezrael Mar 18 '17 at 19:02
  • I realized that there is a conditional join in PySpark but unfortunately not in Pandas. I will try that on Monday as soon as I have access to a Spark cluster. Let you know then! – Matthias Mar 19 '17 at 12:17
  • I ended up using the scipy.spatial.KDTree into which I added all longitude/latitude converted into 3D carthesian coordinates. This was then a lookup-index to determine those points of interest in a close distance to the current one. And it avoids a full join on a dataset of 49.000 rows. But your answer was correct according to my question! – Matthias Mar 26 '17 at 15:50
  • 1
    I ran both options `haversine` vs `haversine_np` on my df of around 6k rows and found that `haversine_np` was 5x faster, ~41 ms. – Ryan R Jan 17 '19 at 05:53
2

UPDATE: I would suggest first to buiuld a distance DataFrame:

from scipy.spatial.distance import squareform, pdist
from itertools import combinations

# see definition of "haversine_np()" below     
x = pd.DataFrame({'dist':pdist(df[['lat','lng']], haversine_np)},
                 index=pd.MultiIndex.from_tuples(tuple(combinations(df['city'], 2))))

efficiently produces pairwise distance DF (without duplicates):

In [106]: x
Out[106]:
                       dist
Berlin  Potsdam   27.198616
        Hamburg  255.063541
Potsdam Hamburg  242.311890

Old answer:

Here is a bit optimized version, which uses scipy.spatial.distance.pdist method:

from scipy.spatial.distance import squareform, pdist

# slightly modified version: of http://stackoverflow.com/a/29546836/2901002
def haversine_np(p1, p2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lat1, lon1, lat2, lon2 = np.radians([p1[0], p1[1],
                                         p2[0], p2[1]])
    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

x = pd.DataFrame(squareform(pdist(df[['lat','lng']], haversine_np)),
                 columns=df.city.unique(),
                 index=df.city.unique())

this gives us:

In [78]: x
Out[78]:
             Berlin     Potsdam     Hamburg
Berlin     0.000000   27.198616  255.063541
Potsdam   27.198616    0.000000  242.311890
Hamburg  255.063541  242.311890    0.000000

let's count number of cities where the distance is greater than 30:

In [81]: x.groupby(level=0, as_index=False) \
    ...:  .apply(lambda c: c[c>30].notnull().sum(1)) \
    ...:  .reset_index(level=0, drop=True)
Out[81]:
Berlin     1
Hamburg    2
Potsdam    1
dtype: int64
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • But if the original data contains millions of lines then this wouldn't work since the number of columns would be exceeded, right? – Matthias Mar 19 '17 at 10:38
  • @Matthias, well if you have millions of __unique__ cities, then it might be slow... But `pdist` - is optimal in terms of calculating pairwise distance. I'd suggest you to try this approach... – MaxU - stand with Ukraine Mar 19 '17 at 10:40
  • Very nice solution, but does not seem to work on the original data. I updated the post above, so that you could try it on the data from the [Kaggle competition](https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries). It contains rental listings based on longitude and latitude and the combinations would be on the df.index. – Matthias Mar 19 '17 at 12:16