2

I want to calculate the geo-distance between latitude-longitude.

I had checked this thread Vectorizing Haversine distance calculation in Python but when I am using it for two different set of coordinates, I m getting an error.

df1 size can be in millions and if there is any other way to calculate accurate geo distance in less time then it would be really helpful.

length1 = 1000
d1 = np.random.uniform(-90, 90, length1)
d2 = np.random.uniform(-180, 180, length1)
length2 = 100
d3 = np.random.uniform(-90, 90, length2)
d4 = np.random.uniform(-180, 180, length2)
coords = tuple(zip(d1, d2))
df1 = pd.DataFrame({'coordinates':coords})
coords = tuple(zip(d3, d4))
df2 = pd.DataFrame({'coordinates':coords})

def get_diff(df1, df2):
    data1 = np.array(df1['coordinates'].tolist())
    data2 = np.array(df2['coordinates'].tolist())
    lat1 = data1[:,0]                     
    lng1 = data1[:,1]
    lat2 = data2[:,0]                     
    lng2 = data2[:,1]
    #print(lat1.shape)
    #print(lng1.shape)
    #print(lat2.shape)
    #print(lng2.shape)
    diff_lat = lat1[:,None] - lat2

    diff_lng = lng1[:,None] - lng2
    #print(diff_lat.shape)
    #print(diff_lng.shape)
    d = np.sin(diff_lat/2)**2 + np.cos(lat1[:,None])*np.cos(lat1) * np.sin(diff_lng/2)**2
    return 2 * 6371 * np.arcsin(np.sqrt(d))

get_diff(df1, df2)
ValueError                                Traceback (most recent call last)
<ipython-input-58-df06c7cff72c> in <module>
----> 1 get_diff(df1, df2)

<ipython-input-57-9bd8f10189e6> in get_diff(df1, df2)
     26     print(diff_lat.shape)
     27     print(diff_lng.shape)
---> 28     d = np.sin(diff_lat/2)**2 + np.cos(lat1[:,None])*np.cos(lat1) * np.sin(diff_lng/2)**2
     29     return 2 * 6371 * np.arcsin(np.sqrt(d))

ValueError: operands could not be broadcast together with shapes (1000,1000) (1000,100) 
Divakar
  • 218,885
  • 19
  • 262
  • 358
Rajat Gupta
  • 72
  • 1
  • 13

2 Answers2

2

Pairwise haversine distances

Here's a vectorized way with broadcasting based on this post -

def convert_to_arrays(df1, df2):
    d1 = np.array(df1['coordinates'].tolist())
    d2 = np.array(df2['coordinates'].tolist())
    return d1,d2

def broadcasting_based_lng_lat(data1, data2):
    # data1, data2 are the data arrays with 2 cols and they hold
    # lat., lng. values in those cols respectively
    data1 = np.deg2rad(data1)                     
    data2 = np.deg2rad(data2)                     

    lat1 = data1[:,0]                     
    lng1 = data1[:,1]         

    lat2 = data2[:,0]                     
    lng2 = data2[:,1]         

    diff_lat = lat1[:,None] - lat2
    diff_lng = lng1[:,None] - lng2
    d = np.sin(diff_lat/2)**2 + np.cos(lat1[:,None])*np.cos(lat2) * np.sin(diff_lng/2)**2
    return 2 * 6371 * np.arcsin(np.sqrt(d))

Hence, to solve your case to get all pairwise haversine distances, it would be -

broadcasting_based_lng_lat(*convert_to_arrays(df1,df2))

Elementwise haversine distances

For element-wise haversine distance computations between two data, such that each data holds latitude and longitude in two columns each or lists of two elements each, we would skip some of the extensions to 2D and end up with something like this -

def broadcasting_based_lng_lat_elementwise(data1, data2):
    # data1, data2 are the data arrays with 2 cols and they hold
    # lat., lng. values in those cols respectively
    data1 = np.deg2rad(data1)                     
    data2 = np.deg2rad(data2)                     

    lat1 = data1[:,0]                     
    lng1 = data1[:,1]         

    lat2 = data2[:,0]                     
    lng2 = data2[:,1]         

    diff_lat = lat1 - lat2
    diff_lng = lng1 - lng2
    d = np.sin(diff_lat/2)**2 + np.cos(lat1)*np.cos(lat2) * np.sin(diff_lng/2)**2
    return 2 * 6371 * np.arcsin(np.sqrt(d))

Sample run with a dataframe holding the two data in two columns -

In [42]: np.random.seed(0)
    ...: a = np.random.randint(10,100,(5,2)).tolist()
    ...: b = np.random.randint(10,100,(5,2)).tolist()
    ...: df = pd.DataFrame({'A':a,'B':b})

In [43]: df
Out[43]: 
          A         B
0  [54, 57]  [80, 98]
1  [74, 77]  [98, 22]
2  [77, 19]  [68, 75]
3  [93, 31]  [49, 97]
4  [46, 97]  [56, 98]

In [44]: from haversine import haversine

In [45]: [haversine(i,j) for (i,j) in zip(df.A,df.B)]
Out[45]: 
[3235.9659882513424,
 2399.6124657290075,
 2012.0851666001824,
 4702.8069773315865,
 1114.1193334220534]

In [46]: broadcasting_based_lng_lat_elementwise(np.vstack(df.A), np.vstack(df.B))
Out[46]: 
array([3235.96151855, 2399.60915125, 2012.08238739, 4702.80048155,
       1114.11779454])

Those slight differences are largely because haversine library assumes 6371.0088 as the earth radius, while we are taking it as 6371 here.

Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • I manually checked some coordinates but the distance is not correct for example df1 coordinates `(49.1709245, -122.9538407)` and df2 coordinates `(49.0456795, -122.7021047)` so calculated_distance = `array([[1020.49177408]])` But actual distance should be around 23 km – Rajat Gupta Aug 28 '19 at 17:13
  • @RajatGupta What are you getting from the haversine function - https://github.com/mapado/haversine? The proposed method is suppoed to simulate that behavior. – Divakar Aug 28 '19 at 17:17
  • The above code is supposed to give the same output as mentioned in the haversine function but it is not. Is there something wrong in this code? – Rajat Gupta Aug 28 '19 at 17:48
  • @RajatGupta What are you getting from the haversine function and from the proposed method? For example, for the pair of coorindinates you just listed earlier in the comments, what are the outputs? – Divakar Aug 28 '19 at 17:49
  • sorry my bad, the data I was testing already had coordinates in radians and again these were converting into radians so I was getting 0.54 km as distance but after removing the conversion these worked. Thanks you..!! – Rajat Gupta Aug 28 '19 at 17:55
0

Use simple print statements to display the arguments to your equation. Some of the operations in you sin expression are different lengths -- the underlying broadcast operation (sort of the vectorized equivalent of zip) requires equal lengths.

Prune
  • 76,765
  • 14
  • 60
  • 81