1

Given a dataframe df as follows:

    id              location        lon       lat
0    1            Onyx Spire  116.35425  39.87760
1    2        Unison Lookout  116.44333  39.93237
2    3       History Lookout  116.14857  39.73727
3    4     Domination Pillar  116.46387  39.96286
4    5           Union Tower  116.36373  39.95064
5    6   Ruby Forest Obelisk  116.35786  39.89463
6    7      Rust Peak Pillar  116.34870  39.98170
7    8      Ash Forest Tower  116.38461  39.94938
8    9  Prestige Mound Tower  116.34052  39.98977
9   10  Sapphire Mound Tower  116.35063  39.92982
10  11       Kinship Lookout  116.43020  39.99997
11  12    Exhibition Obelisk  116.45108  39.94371

For each location, I need to find out other locations names if the distance between them are less than and equal to, say 5 km.

The code based on answers from this link:

from scipy.spatial import distance
from math import sin, cos, sqrt, atan2, radians

def get_distance(point1, point2):
    R = 6370
    lat1 = radians(point1[0])  #insert value
    lon1 = radians(point1[1])
    lat2 = radians(point2[0])
    lon2 = radians(point2[1])

    dlon = lon2 - lon1
    dlat = lat2- lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1-a))
    distance = R * c
    return distance

all_points = df[['lat', 'lon']].values
dm = distance.cdist(all_points, all_points, get_distance)
pd.DataFrame(dm, index=df.index, columns=df.index)

Out:

           0          1          2   ...         9          10         11
0    0.000000   9.736316  23.494395  ...   5.813891  15.066709  11.054762
1    9.736316   0.000000  33.222475  ...   7.908015   7.598415   1.423357
2   23.494395  33.222475   0.000000  ...  27.492814  37.822285  34.549129
3   13.312235   3.815179  36.787014  ...  10.327235   5.024900   2.391864
4    8.160542   7.082601  30.000842  ...   2.569988   7.883467   7.484839
5    1.918235   8.409888  25.009951  ...   3.960618  13.235325   9.641336
6   11.583243   9.752599  32.096627  ...   5.770232   7.233093   9.692770
7    8.389761   5.350670  31.017383  ...   3.622002   6.835323   5.700434
8   12.525586  10.838805  32.501864  ...   6.720541   7.722060  10.722467
9    5.813891   7.908015  27.492814  ...   0.000000  10.334273   8.701063
10  15.066709   7.598415  37.822285  ...  10.334273   0.000000   6.502921
11  11.054762   1.423357  34.549129  ...   8.701063   6.502921   0.000000

But I would like to get a output similar to the following dataframe. Please note location1, location2, location3 are the names of locations which have distance <= 5 km from location (the paired location names may be not accurate, just using as examples to help understand), if it's NaN, then no such location exists:

    id              location  ...            location2          location3
0    1            Onyx Spire  ...                  NaN                NaN
1    2        Unison Lookout  ...                  NaN                NaN
2    3       History Lookout  ...                  NaN                NaN
3    4     Domination Pillar  ...                  NaN                NaN
4    5           Union Tower  ...                  NaN                NaN
5    6   Ruby Forest Obelisk  ...                  NaN                NaN
6    7      Rust Peak Pillar  ...                  NaN                NaN
7    8      Ash Forest Tower  ...      Kinship Lookout                NaN
8    9  Prestige Mound Tower  ...                  NaN                NaN
9   10  Sapphire Mound Tower  ...                  NaN                NaN
10  11       Kinship Lookout  ...  Ruby Forest Obelisk  Domination Pillar
11  12    Exhibition Obelisk  ...                  NaN                NaN

How could I do that in Python? Thanks.

ah bon
  • 9,293
  • 12
  • 65
  • 148

2 Answers2

2

Idea is create mask for not 0 values and less like 5km, then use DataFrame.dot for matrix multiplication nas last use Series.str.split for new columns joined to original:

df1 = pd.DataFrame(dm, index=df.index, columns=df.index)

df = (df.join((df1.ne(0) & df1.lt(5)).dot(df['location']+ ',')
                                     .str[:-1]
                                     .str.split(',', expand=True)
                                     .add_prefix('loc')))

print (df)
    id              location        lon       lat                 loc0  \
0    1            Onyx Spire  116.35425  39.87760  Ruby Forest Obelisk   
1    2        Unison Lookout  116.44333  39.93237    Domination Pillar   
2    3       History Lookout  116.14857  39.73727                        
3    4     Domination Pillar  116.46387  39.96286       Unison Lookout   
4    5           Union Tower  116.36373  39.95064     Rust Peak Pillar   
5    6   Ruby Forest Obelisk  116.35786  39.89463           Onyx Spire   
6    7      Rust Peak Pillar  116.34870  39.98170          Union Tower   
7    8      Ash Forest Tower  116.38461  39.94938          Union Tower   
8    9  Prestige Mound Tower  116.34052  39.98977          Union Tower   
9   10  Sapphire Mound Tower  116.35063  39.92982          Union Tower   
10  11       Kinship Lookout  116.43020  39.99997                        
11  12    Exhibition Obelisk  116.45108  39.94371       Unison Lookout   

                    loc1                  loc2                  loc3  
0                   None                  None                  None  
1     Exhibition Obelisk                  None                  None  
2                   None                  None                  None  
3     Exhibition Obelisk                  None                  None  
4       Ash Forest Tower  Prestige Mound Tower  Sapphire Mound Tower  
5   Sapphire Mound Tower                  None                  None  
6       Ash Forest Tower  Prestige Mound Tower                  None  
7       Rust Peak Pillar  Sapphire Mound Tower                  None  
8       Rust Peak Pillar                  None                  None  
9    Ruby Forest Obelisk      Ash Forest Tower                  None  
10                  None                  None                  None  
11     Domination Pillar                  None                  None  

For sorted values use:

df1 = pd.DataFrame(dm, index=df.index, columns=df['location'])

df1 = df.join(df1.apply(lambda x: pd.Series(x[(x!=0)&(x < 5)].sort_values().index), axis=1)
                .add_prefix('loc'))
print (df1)
    id              location        lon       lat                  loc0  \
0    1            Onyx Spire  116.35425  39.87760   Ruby Forest Obelisk   
1    2        Unison Lookout  116.44333  39.93237    Exhibition Obelisk   
2    3       History Lookout  116.14857  39.73727                   NaN   
3    4     Domination Pillar  116.46387  39.96286    Exhibition Obelisk   
4    5           Union Tower  116.36373  39.95064      Ash Forest Tower   
5    6   Ruby Forest Obelisk  116.35786  39.89463            Onyx Spire   
6    7      Rust Peak Pillar  116.34870  39.98170  Prestige Mound Tower   
7    8      Ash Forest Tower  116.38461  39.94938           Union Tower   
8    9  Prestige Mound Tower  116.34052  39.98977      Rust Peak Pillar   
9   10  Sapphire Mound Tower  116.35063  39.92982           Union Tower   
10  11       Kinship Lookout  116.43020  39.99997                   NaN   
11  12    Exhibition Obelisk  116.45108  39.94371        Unison Lookout   

                    loc1                 loc2                  loc3  
0                    NaN                  NaN                   NaN  
1      Domination Pillar                  NaN                   NaN  
2                    NaN                  NaN                   NaN  
3         Unison Lookout                  NaN                   NaN  
4   Sapphire Mound Tower     Rust Peak Pillar  Prestige Mound Tower  
5   Sapphire Mound Tower                  NaN                   NaN  
6            Union Tower     Ash Forest Tower                   NaN  
7   Sapphire Mound Tower     Rust Peak Pillar                   NaN  
8            Union Tower                  NaN                   NaN  
9       Ash Forest Tower  Ruby Forest Obelisk                   NaN  
10                   NaN                  NaN                   NaN  
11     Domination Pillar                  NaN                   NaN  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks, btw, could we arrange `loc0, loc1, ... loc3` from shortest to longest distance? – ah bon Mar 04 '21 at 08:53
  • 1
    @ahbon - It is complicated, but possible. I am working on answer. – jezrael Mar 04 '21 at 09:11
  • BTW, I found the some paired locations are not in the distance range of `5 km`, maybe it cause by inaccurate setting radius of earth, which is `R = 6370`? – ah bon Mar 04 '21 at 09:16
  • @ahbon - Unfortuantely this area is unknown for me, so no idea. – jezrael Mar 04 '21 at 09:17
  • 1
    @ahbon - Maybe need some solution with geopandas, never working with it, so no idea. I add solution for sortting from lower to upper values. – jezrael Mar 04 '21 at 09:23
1

Here a method using BallTree, with sorted from shortest to longest distance

from sklearn.neighbors import BallTree
import pandas as pd
import numpy as np


data = { 'lon' : [116.35425, 116.44333, 116.14857, 116.46387, 116.36373, 116.35786, 116.34870, 116.38461, 116.34052, 116.35063, 116.43020, 116.45108],
'lat' : [39.87760, 39.93237, 39.73727, 39.96286, 39.95064, 39.89463, 39.98170, 39.94938, 39.98977, 39.92982, 39.99997, 39.94371],
'location' : ["Onyx Spire", "Unison Lookout", "History Lookout", "Domination Pillar", "Union Tower", "Ruby Forest Obelisk", "Rust Peak Pillar", "Ash Forest Tower", "Prestige Mound Tower", "Sapphire Mound Tower", "Kinship Lookout", "Exhibition Obelisk"]}

locations = pd.DataFrame.from_dict(data)

Create the BallTree

locations_radians =  np.radians(locations[["lat","lon"]].values)
tree = BallTree(locations_radians, leaf_size=12, metric='haversine')
distance_in_meters = 5000
earth_radius = 6371000
    
radius = distance_in_meters / earth_radius

Notice I first sort the is_within in is_within_sorted

is_within, distances = tree.query_radius(locations_radians, r=radius, count_only=False, return_distance=True) 

is_within_sorted = [ iw[ np.argsort(di) ] for iw,di in zip(is_within, distances) ]
distances_sorted = [np.sort(d) for d in distances]

is_within is containing arrays, of different length, that will return the indicis of locations that are within the radius. You could just store those, together with the actual distances.

Now I pad with Nan and create a DF, to later join

pad_with_nans = [ np.pad(locations.location[iw], (0,locations.lat.size), 'constant', constant_values=np.nan)[:locations.lat.size] for iw in is_within_sorted]
location_names = [ 'location_{}'.format(i) for i in range(locations.lat.size) ]
within_radius = pd.DataFrame(pad_with_nans, index=locations.index, columns=location_names)

and we have

locations.join(within_radius)

Giving

         lon       lat           location         location_0  \
0  116.35425  39.87760         Onyx Spire         Onyx Spire   
1  116.44333  39.93237     Unison Lookout     Unison Lookout   
2  116.14857  39.73727    History Lookout    History Lookout   
3  116.46387  39.96286  Domination Pillar  Domination Pillar   
4  116.36373  39.95064        Union Tower        Union Tower   

            location_1            location_2        location_3  \
0  Ruby Forest Obelisk                   NaN               NaN   
1   Exhibition Obelisk     Domination Pillar               NaN   
2                  NaN                   NaN               NaN   
3   Exhibition Obelisk        Unison Lookout               NaN   
4     Ash Forest Tower  Sapphire Mound Tower  Rust Peak Pillar   

             location_4  location_5  location_6  location_7  location_8  \
0                   NaN         NaN         NaN         NaN         NaN   
1                   NaN         NaN         NaN         NaN         NaN   
2                   NaN         NaN         NaN         NaN         NaN   
3                   NaN         NaN         NaN         NaN         NaN   
4  Prestige Mound Tower         NaN         NaN         NaN         NaN   

   location_9  location_10  location_11  
0         NaN          NaN          NaN  
1         NaN          NaN          NaN  
2         NaN          NaN          NaN  
3         NaN          NaN          NaN  
4         NaN          NaN          NaN  

The point itself is always within itself, so you could remove the first column.

Willem Hendriks
  • 1,267
  • 2
  • 9
  • 15