2

I have two datasets and the datasets having longitude and latitude values.

let us say :

  • point_x1 is (lang_1, latt_1)
  • point_x2 is (lang_2, latt_2)
  • the first dataset has "n" rows of data with point_x1, x1
  • the second dataset has "m" rows of data with point_x2, x2

where m > n

Edit: Note: m will be 20000 or more, and n will be 5000 or more.

I want to group or merge both datasets.

I want to find for each point_x2 nearest point_x1 and then would like to create a new data with point_x2, x2, x1 (where point_x1 is nearest to point_x2) for each line in dataset2.

dataset1-sample:

-91.850532 40.376043 x1_a1
-91.850519 40.376043 x1_a2
-91.850504 40.376043 x1_a3
-91.850487 40.376043 x1_a4
-91.850399 40.376044 x1_a5
-91.850353 40.376044 x1_a6

dataset2-sample:

-91.848442 40.380573 x2_a0
-91.850292 40.378533 x2_a1
-91.849919 40.377883 x2_a2
-91.849109 40.385833 x2_a3
-91.845884 40.381623 x2_a4
-91.847344 40.376693 x2_a5
-91.846937 40.382653 x2_a6
-91.849827 40.381343 x2_a7
-91.850149 40.383474 x2_a8
-91.848569 40.384904 x2_a9
-91.849063 40.377384 x2_a10
-91.845563 40.378604 x2_a11

I have no much idea of data science or geo analysis. Looking for help on approach.

Please suggest me how to do that.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Hara
  • 1,467
  • 4
  • 18
  • 35
  • some similar question in `r` language is found... in this [link](https://stackoverflow.com/questions/27442506/find-nearest-points-of-latitude-and-longitude-from-different-data-sets-with-diff) – Hara Mar 08 '18 at 15:10

2 Answers2

2

I write some sample code.You could try like this:

from math import radians, cos, sin, asin, sqrt
import pandas as pd

def geo_distance(lng1,lat1,lng2,lat2):
    lng1, lat1, lng2, lat2 = map(radians, [lng1, lat1, lng2, lat2])
    dlon=lng2-lng1
    dlat=lat2-lat1
    a=sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2 
    dis=2*asin(sqrt(a))*6371*1000
    return dis



df1 = pd.DataFrame({'lang_1':[-91.850532,-91.850519,-91.850504,-91.850487,-91.850399,-91.850353],
                    'latt_1':[40.376043,40.376043,40.376043,40.376043,40.376044,40.376044],
                    'x1':['x1_a1','x1_a2','x1_a3','x1_a4','x1_a5','x1_a6']})
df2 = pd.DataFrame({'lang_2':[-91.848442,-91.850292,-91.849919,-91.849109,-91.845884,-91.847344,-91.846937,-91.849827,-91.850149,-91.848569,-91.849063,-91.845563],
                    'latt_2':[40.380573,40.378533,40.377883,40.385833,40.381623,40.376693,40.382653,40.381343,40.383474,40.384904,40.377384,40.378604],
                    'x2':['x2_a0','x2_a1','x2_a2','x2_a3','x2_a4','x2_a5','x2_a6','x2_a7','x2_a8','x2_a9','x2_a10','x2_a11']})

df1['key']=0
df2['key']=0

df_cartesian = df2.merge(df1, how='outer')
df_cartesian['geo_distance']=df_cartesian.apply(lambda row:geo_distance(row['lang_1'],row['latt_1'],row['lang_2'],row['latt_2']),axis=1)
df_cartesian_min_distance=df_cartesian.sort_values(by="geo_distance").groupby(["lang_2","latt_2","x2"],as_index=False).first()
print(df_cartesian_min_distance.ix[:,["lang_2","latt_2","x2","x1"]])
William Feirie
  • 624
  • 3
  • 7
  • Hi William Feire, I am able to execute your simple code. But my dataframes are m=20k lines and n=5k lines. when I give that data as input,it is hanging. can u suggest me? – Hara Mar 09 '18 at 06:06
1

I'm not sure if it's useful, but I came up with a somewhat more compact version than William:

import pandas

dataset1 = pandas.DataFrame(data={'x':(-91.850532, -91.850519, -91.850504, -91.850487, -91.850399, -91.850353),
                                  'y':(40.376043, 40.376043,  0.376043, 40.376043, 40.376044, 40.376044)},
                            index=('x1_a1', 'x1_a2', 'x1_a3', 'x1_a4', 'x1_a5', 'x1_a6'))


dataset2 = pandas.DataFrame(data={'x':(-91.848442, -91.850292, -91.849919, -91.849109, -91.845884, -91.847344, -91.846937, -91.849827, -91.850149, -91.848569, -91.849063, -91.845563),
                                  'y':(40.380573, 40.378533, 40.377883, 40.385833, 40.381623, 40.376693, 40.382653, 40.381343, 40.383474, 40.384904, 40.377384, 40.378604)},
                            index=('x2_a0', 'x2_a1', 'x2_a2', 'x2_a3', 'x2_a4', 'x2_a5', 'x2_a6', 'x2_a7', 'x2_a8', 'x2_a9', 'x2_a10', 'x2_a11'))

closest_points = {}
for name, point in dataset1.iterrows():
    distances = (((dataset2 - point) ** 2).sum(axis=1)**.5)
    closest_points[name] = distances.sort_values().index[0]

It takes the simple Euclidian between the two sets of points and, for each point in dataset1, gets the name of the point in dataset2 closest to it. I'm sure you can easily adapt it to suit your needs from here on.

  • Hi Martin, thanks for the answer. But I am completely new to pandas and its functions. I am very good in python, but not on numpy and pandas as of now. I am learning pandas. thanks for your contribution. – Hara Mar 08 '18 at 16:45
  • 1
    I'm not much more advanced than yourself. Pandas is a world unto itself, and it takes some getting used to, but if you're working with number crunching like this, it's worth it. You might want to look more closely at GeoPandas to see if it has functionality in which to express this more elegantly. – Martin Sand Christensen Mar 08 '18 at 21:13
  • yes true. I want if solution is in geo pandas. my datasets are bigger. each of them are atleast 20k lines. – Hara Mar 09 '18 at 04:19