10

I am a location in a dataframe, underneath lat lon column names. I want to show how far that is from the lat lon of the nearest train station in a separate dataframe.

So for example, I have a lat lon of (37.814563 144.970267), and i have a list as below of other geospatial points. I want to find the point that is closest and then find the distance between those points, as an extra column in the dataframe in suburbs.

This is the example of the train dataset

<bound method NDFrame.to_clipboard of   STOP_ID                                          STOP_NAME   LATITUDE  \
0   19970             Royal Park Railway Station (Parkville) -37.781193   
1   19971  Flemington Bridge Railway Station (North Melbo... -37.788140   
2   19972         Macaulay Railway Station (North Melbourne) -37.794267   
3   19973   North Melbourne Railway Station (West Melbourne) -37.807419   
4   19974        Clifton Hill Railway Station (Clifton Hill) -37.788657   

    LONGITUDE TICKETZONE                                          ROUTEUSSP  \
0  144.952301          1                                            Upfield   
1  144.939323          1                                            Upfield   
2  144.936166          1                                            Upfield   
3  144.942570          1  Flemington,Sunbury,Upfield,Werribee,Williamsto...   
4  144.995417          1                                 Mernda,Hurstbridge   

                      geometry  
0  POINT (144.95230 -37.78119)  
1  POINT (144.93932 -37.78814)  
2  POINT (144.93617 -37.79427)  
3  POINT (144.94257 -37.80742)  
4  POINT (144.99542 -37.78866)  >

and this is an example of the suburbs

<bound method NDFrame.to_clipboard of       postcode              suburb state        lat         lon
4901      3000           MELBOURNE   VIC -37.814563  144.970267
4902      3002      EAST MELBOURNE   VIC -37.816640  144.987811
4903      3003      WEST MELBOURNE   VIC -37.806255  144.941123
4904      3005  WORLD TRADE CENTRE   VIC -37.822262  144.954856
4905      3006           SOUTHBANK   VIC -37.823258  144.965926>

Which I am trying to show, is the distance from the lat lon to the closet train station in a new column for the suburb list.

Using a solution get a weird output, wondering if it's correct?

With both solutions shown,

from sklearn.neighbors import NearestNeighbors
from haversine import haversine

NN = NearestNeighbors(n_neighbors=1, metric='haversine')
NN.fit(trains_shape[['LATITUDE', 'LONGITUDE']])

indices = NN.kneighbors(df_complete[['lat', 'lon']])[1]
indices = [index[0] for index in indices]
distances = NN.kneighbors(df_complete[['lat', 'lon']])[0]
df_complete['closest_station'] = trains_shape.iloc[indices]['STOP_NAME'].reset_index(drop=True)
df_complete['closest_station_distances'] = distances
print(df_complete)

The output here,

<bound method NDFrame.to_clipboard of    postcode        suburb state        lat         lon  Venues Cluster  \
1      3040    aberfeldie   VIC -37.756690  144.896259             4.0   
2      3042  airport west   VIC -37.711698  144.887037             1.0   
4      3206   albert park   VIC -37.840705  144.955710             0.0   
5      3020        albion   VIC -37.775954  144.819395             2.0   
6      3078    alphington   VIC -37.780767  145.031160             4.0   

                     #1                    #2             #3  \
1                  Café     Electronics Store  Grocery Store   
2  Fast Food Restaurant                  Café    Supermarket   
4                  Café                   Pub    Coffee Shop   
5                  Café  Fast Food Restaurant  Grocery Store   
6                  Café                  Park            Bar   

                      #4  ...                             #6  \
1            Coffee Shop  ...                         Bakery   
2          Grocery Store  ...             Italian Restaurant   
4         Breakfast Spot  ...                   Burger Joint   
5  Vietnamese Restaurant  ...                            Pub   
6            Pizza Place  ...  Vegetarian / Vegan Restaurant   

                      #7                   #8                         #9  \
1          Shopping Mall  Japanese Restaurant          Indian Restaurant   
2  Portuguese Restaurant    Electronics Store  Middle Eastern Restaurant   
4                    Bar               Bakery                  Gastropub   
5     Chinese Restaurant                  Gym                     Bakery   
6     Italian Restaurant            Gastropub                     Bakery   

                 #10 Ancestry Cluster  ClosestStopId  \
1   Greek Restaurant              8.0          20037   
2  Convenience Store              5.0          20032   
4              Beach              6.0          22180   
5  Convenience Store              5.0          20004   
6        Coffee Shop              5.0          19931   

                                   ClosestStopName  \
1              Essendon Railway Station (Essendon)   
2                Glenroy Railway Station (Glenroy)   
4  Southern Cross Railway Station (Melbourne City)   
5          Albion Railway Station (Sunshine North)   
6          Alphington Railway Station (Alphington)   

                                   closest_station closest_station_distances  
1                Glenroy Railway Station (Glenroy)                  0.019918  
2  Southern Cross Railway Station (Melbourne City)                  0.031020  
4          Alphington Railway Station (Alphington)                  0.023165  
5                  Altona Railway Station (Altona)                  0.005559  
6                Newport Railway Station (Newport)                  0.002375  

And the second function.

def ClosestStop(r):
    # Cartesin Distance: square root of (x2-x2)^2 + (y2-y1)^2
    distances = ((r['lat']-StationDf['LATITUDE'])**2 + (r['lon']-StationDf['LONGITUDE'])**2)**0.5
    
    # Stop with minimum Distance from the Suburb
    closestStationId = distances[distances == distances.min()].index.to_list()[0]
    return StationDf.loc[closestStationId, ['STOP_ID', 'STOP_NAME']]

df_complete[['ClosestStopId', 'ClosestStopName']] = df_complete.apply(ClosestStop, axis=1)

This is giving different answers oddly enough, and leads me to think that there is an issue with this code. the KM's seem wrong as well.

Completely unsure how to approach this problem - would love some guidance here, thanks!

LeCoda
  • 538
  • 7
  • 36
  • 79
  • 1
    You need 1. a function `distance(lat1, lon1, lat2, lon2)`, 2. apply to each combination of suburb and station, 3. take station with shortest distance per suburb and add to data frame. (Or use a NearestNeighbor classifier from sklearn) – Niklas Mertsch Aug 09 '20 at 11:53
  • See the answer here https://stackoverflow.com/q/365826/6692898 – RichieV Aug 09 '20 at 12:06
  • 1
    With the first solution, you used 'haversine' as distance function in the NN which is the built in haversine distance in sklearn, which expressed in radius. You can see the link to the doc in my answer. To have an haversine distance expressed in km, use the imported haversine package as distance in the NN. This is also expressed in my answer. – SoufianeK Aug 16 '20 at 17:40
  • Can you share the number of cities & stations you want to calculate the distance too? I did not yet the scalable BallTree algorithm example here, which is what you need when the numbers scale up. – Willem Hendriks Aug 17 '20 at 14:47

4 Answers4

7

A few key concepts

  1. do a Cartesian product between two data frames to get all combinations (joining on identical value between two data frames is approach to this foo=1)
  2. once both sets of data is together, have both sets of lat/lon to calculate distance) geopy has been used for this
  3. cleanup the columns, use sort_values() to find smallest distance
  4. finally a groupby() and agg() to get first values for shortest distance

There are two data frames for use

  1. dfdist contains all the combinations and distances
  2. dfnearest which contains result
dfstat = pd.DataFrame({'STOP_ID': ['19970', '19971', '19972', '19973', '19974'],
 'STOP_NAME': ['Royal Park Railway Station (Parkville)',
  'Flemington Bridge Railway Station (North Melbo...',
  'Macaulay Railway Station (North Melbourne)',
  'North Melbourne Railway Station (West Melbourne)',
  'Clifton Hill Railway Station (Clifton Hill)'],
 'LATITUDE': ['-37.781193',
  '-37.788140',
  '-37.794267',
  '-37.807419',
  '-37.788657'],
 'LONGITUDE': ['144.952301',
  '144.939323',
  '144.936166',
  '144.942570',
  '144.995417'],
 'TICKETZONE': ['1', '1', '1', '1', '1'],
 'ROUTEUSSP': ['Upfield',
  'Upfield',
  'Upfield',
  'Flemington,Sunbury,Upfield,Werribee,Williamsto...',
  'Mernda,Hurstbridge'],
 'geometry': ['POINT (144.95230 -37.78119)',
  'POINT (144.93932 -37.78814)',
  'POINT (144.93617 -37.79427)',
  'POINT (144.94257 -37.80742)',
  'POINT (144.99542 -37.78866)']})
dfsub = pd.DataFrame({'id': ['4901', '4902', '4903', '4904', '4905'],
 'postcode': ['3000', '3002', '3003', '3005', '3006'],
 'suburb': ['MELBOURNE',
  'EAST MELBOURNE',
  'WEST MELBOURNE',
  'WORLD TRADE CENTRE',
  'SOUTHBANK'],
 'state': ['VIC', 'VIC', 'VIC', 'VIC', 'VIC'],
 'lat': ['-37.814563', '-37.816640', '-37.806255', '-37.822262', '-37.823258'],
 'lon': ['144.970267', '144.987811', '144.941123', '144.954856', '144.965926']})

import geopy.distance
# cartesian product so we get all combinations
dfdist = (dfsub.assign(foo=1).merge(dfstat.assign(foo=1), on="foo")
    # calc distance in km between each suburb and each train station
     .assign(km=lambda dfa: dfa.apply(lambda r: 
                                      geopy.distance.geodesic(
                                          (r["LATITUDE"],r["LONGITUDE"]), 
                                          (r["lat"],r["lon"])).km, axis=1))
    # reduce number of columns to make it more digestable
     .loc[:,["postcode","suburb","STOP_ID","STOP_NAME","km"]]
    # sort so shortest distance station from a suburb is first
     .sort_values(["postcode","suburb","km"])
    # good practice
     .reset_index(drop=True)
)
# finally pick out stations nearest to suburb
# this can easily be joined back to source data frames as postcode and STOP_ID have been maintained
dfnearest = dfdist.groupby(["postcode","suburb"])\
    .agg({"STOP_ID":"first","STOP_NAME":"first","km":"first"}).reset_index()

print(dfnearest.to_string(index=False))
dfnearest

output

postcode              suburb STOP_ID                                         STOP_NAME        km
    3000           MELBOURNE   19973  North Melbourne Railway Station (West Melbourne)  2.564586
    3002      EAST MELBOURNE   19974       Clifton Hill Railway Station (Clifton Hill)  3.177320
    3003      WEST MELBOURNE   19973  North Melbourne Railway Station (West Melbourne)  0.181463
    3005  WORLD TRADE CENTRE   19973  North Melbourne Railway Station (West Melbourne)  1.970909
    3006           SOUTHBANK   19973  North Melbourne Railway Station (West Melbourne)  2.705553

an approach to reducing size of tested combinations

# pick nearer places,  based on lon/lat then all combinations
dfdist = (dfsub.assign(foo=1, latr=dfsub["lat"].round(1), lonr=dfsub["lon"].round(1))
          .merge(dfstat.assign(foo=1, latr=dfstat["LATITUDE"].round(1), lonr=dfstat["LONGITUDE"].round(1)), 
                 on=["foo","latr","lonr"])
    # calc distance in km between each suburb and each train station
     .assign(km=lambda dfa: dfa.apply(lambda r: 
                                      geopy.distance.geodesic(
                                          (r["LATITUDE"],r["LONGITUDE"]), 
                                          (r["lat"],r["lon"])).km, axis=1))
    # reduce number of columns to make it more digestable
     .loc[:,["postcode","suburb","STOP_ID","STOP_NAME","km"]]
    # sort so shortest distance station from a suburb is first
     .sort_values(["postcode","suburb","km"])
    # good practice
     .reset_index(drop=True)
)
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
  • Hi man, this is great, except that when I use it to process properties instead it takes up all my memory :P Is there a more efficient way, or a way to batch process this? – LeCoda Aug 17 '20 at 14:27
  • It will be the pure Cartesian product that's causing the issue if you have large data set.... do you have addresses and stations across multiple cities? If so I'd suggest adding the city to join key when `dfdest` is generated. i.e. don't generate irrelevant combinations... – Rob Raymond Aug 17 '20 at 14:47
  • ^^ Good point. It's only for one city, and specifically for addresses distance to train stations (and buses etc). I'm thinking batching or something? – LeCoda Aug 17 '20 at 14:50
  • 1
    just added to answer as an idea. I would expect nearer locations to have same rounded Lon/lat – Rob Raymond Aug 17 '20 at 15:25
6

Try this

import pandas as pd
def ClosestStop(r):
    # Cartesin Distance: square root of (x2-x2)^2 + (y2-y1)^2
    distances = ((r['lat']-StationDf['LATITUDE'])**2 + (r['lon']-StationDf['LONGITUDE'])**2)**0.5
    
    # Stop with minimum Distance from the Suburb
    closestStationId = distances[distances == distances.min()].index.to_list()[0]
    return StationDf.loc[closestStationId, ['STOP_ID', 'STOP_NAME']]

StationDf = pd.read_excel("StationData.xlsx")
SuburbDf = pd.read_excel("SuburbData.xlsx")

SuburbDf[['ClosestStopId', 'ClosestStopName']] = SuburbDf.apply(ClosestStop, axis=1)
print(SuburbDf)
Kuldip Chaudhari
  • 1,112
  • 4
  • 8
  • 1
    Cartesian distance is not suitable for distances with GPS coordinates. See haversine distance https://en.m.wikipedia.org/wiki/Haversine_formula – SoufianeK Aug 13 '20 at 13:05
  • @SoufianeK Yes Cartesian distances are not suitable when you are working with lat-long variations in several degrees i.e. gobal distances. But here the objective was to get the closest railway station to suburbs which hardly covers a region of a degree(lat-long). Also, here the magnitude and unit of distance is not important, only how the distances compare.So, Cartesian distances are good enough for the purpose.Thanks for sharing link, I work on GIS mapping, it will be helpful. – Kuldip Chaudhari Aug 13 '20 at 13:54
5

You can use sklearn.neighbors.NearestNeighbors with a haversine distance.

import pandas as pd
dfstat = pd.DataFrame({'STOP_ID': ['19970', '19971', '19972', '19973', '19974'],
                       'STOP_NAME': ['Royal Park Railway Station (Parkville)',  'Flemington Bridge Railway Station (North Melbo...',  'Macaulay Railway Station (North Melbourne)',  'North Melbourne Railway Station (West Melbourne)',  'Clifton Hill Railway Station (Clifton Hill)'],
                       'LATITUDE': ['-37.781193', '-37.788140',  '-37.794267',  '-37.807419',  '-37.788657'],
                       'LONGITUDE': ['144.952301', '144.939323', '144.936166',  '144.942570',  '144.995417'],
                       'TICKETZONE': ['1', '1', '1', '1', '1'], 
                       'ROUTEUSSP': ['Upfield',  'Upfield',  'Upfield',  'Flemington,Sunbury,Upfield,Werribee,Williamsto...',  'Mernda,Hurstbridge'],
                       'geometry': ['POINT (144.95230 -37.78119)',  'POINT (144.93932 -37.78814)',  'POINT (144.93617 -37.79427)',  'POINT (144.94257 -37.80742)',  'POINT (144.99542 -37.78866)']})
dfsub = pd.DataFrame({'id': ['4901', '4902', '4903', '4904', '4905'],
                      'postcode': ['3000', '3002', '3003', '3005', '3006'],
                      'suburb': ['MELBOURNE',  'EAST MELBOURNE',  'WEST MELBOURNE',  'WORLD TRADE CENTRE',  'SOUTHBANK'],
                      'state': ['VIC', 'VIC', 'VIC', 'VIC', 'VIC'],
                      'lat': ['-37.814563', '-37.816640', '-37.806255', '-37.822262', '-37.823258'],
                      'lon': ['144.970267', '144.987811', '144.941123', '144.954856', '144.965926']})

Let's begin by finding the closest point in a dataframe to some random point, say -37.814563, 144.970267.

NN = NearestNeighbors(n_neighbors=1, metric='haversine')
NN.fit(dfstat[['LATITUDE', 'LONGITUDE']])
NN.kneighbors([[-37.814563, 144.970267]])

The output is (array([[2.55952637]]), array([[3]])), the distance and the index of the closest point in the dataframe. The haversine distance in sklearn is in radius. If you want to compute is in km, you can use haversine.

from haversine import haversine
NN = NearestNeighbors(n_neighbors=1, metric=haversine)
NN.fit(dfstat[['LATITUDE', 'LONGITUDE']])
NN.kneighbors([[-37.814563, 144.970267]])

The output (array([[2.55952637]]), array([[3]])) has the distance in km.

Now you can apply to all points in the dataframe and get closest stations with indices.

indices = NN.kneighbors(dfsub[['lat', 'lon']])[1]
indices = [index[0] for index in indices]
distances = NN.kneighbors(dfsub[['lat', 'lon']])[0]
dfsub['closest_station'] = dfstat.iloc[indices]['STOP_NAME'].reset_index(drop=True)
dfsub['closest_station_distances'] = distances
print(dfsub)
id  postcode    suburb  state   lat lon closest_station closest_station_distances
0   4901    3000    MELBOURNE   VIC -37.814563  144.970267  North Melbourne Railway Station (West Melbourne)    2.559526
1   4902    3002    EAST MELBOURNE  VIC -37.816640  144.987811  Clifton Hill Railway Station (Clifton Hill) 3.182521
2   4903    3003    WEST MELBOURNE  VIC -37.806255  144.941123  North Melbourne Railway Station (West Melbourne)    0.181419
3   4904    3005    WORLD TRADE CENTRE  VIC -37.822262  144.954856  North Melbourne Railway Station (West Melbourne)    1.972010
4   4905    3006    SOUTHBANK   VIC -37.823258  144.965926  North Melbourne Railway Station (West Melbourne)    2.703926
SoufianeK
  • 300
  • 1
  • 6
  • I just added a distance column, some imports, and correct a missing bracket. – SoufianeK Aug 12 '20 at 08:35
  • I copied code, is it working correctly in my example? It looks like the distances are off by a order of magnitude? – LeCoda Aug 17 '20 at 01:33
  • Usually- the distance of haversine is the great circle distance. You need to convert it back to earthy km's. I explained it here https://stackoverflow.com/questions/63121268/how-can-i-introduce-the-radio-in-query-radius-balltree-sklearn-radians-km The example is query_radius. Sklearn also supports Nearest Neighbors. Which is implemented very similar to NN used below (same speed). Don't use methods require calculating full distance matrix. – Willem Hendriks Aug 17 '20 at 08:55
  • try defining NN as `from haversine import haversine NN = NearestNeighbors(n_neighbors=1, metric=haversine)`. Note that there are no quotes in `metric=haversine`. In fact `metric='haversine'` uses the haversine distance in sklearn which is expressed in radius. – SoufianeK Aug 17 '20 at 09:18
  • @SoufianeK Unfortunately this is giving me the wrong distances. I like the idea of using this approach as it is more accurate but unfortunately looks like it will be a bit difficult to implement : / – LeCoda Aug 17 '20 at 09:58
  • @MichaelHolborn this is surprising since haversine lib is a reference in computing distances and sklearn lib is a reference in machine learning task as finding the nearest neighbor. I used them for similar tasks on production. Can you provide me an example code and its result to find what is wrong? – SoufianeK Aug 17 '20 at 10:02
  • Sure - perhaps it is data types? I will link above – LeCoda Aug 17 '20 at 10:03
  • Maybe it is data types. Can you first try with my fake data frame, that I copied from your question, to be sure about the code before we move forward with the data type? Also, make sure to input latitude then longitude, which is the norm for computing distances for haversine (see https://github.com/mapado/haversine or https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html#sklearn.neighbors.DistanceMetric) – SoufianeK Aug 17 '20 at 10:07
  • Your code works perfectly, it does not work with the datasets I have. THe lat and lon are obj(64) types – LeCoda Aug 17 '20 at 10:55
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/219962/discussion-between-soufianek-and-michael-holborn). – SoufianeK Aug 17 '20 at 12:20
1

I would like to post an article that I found and tried myself and it worked while I was at the university. You can use Google Distance Matrix Api. Rather than showing particular code, I would like to refer you to the article itself:

https://medium.com/how-to-use-google-distance-matrix-api-in-python/how-to-use-google-distance-matrix-api-in-python-ef9cd895303c

For a given data set organized in rows of latitude and longitude coordinates, you can calculate the distance between consecutive rows. That will give you the actual distance between two different points.

Marios
  • 26,333
  • 8
  • 32
  • 52