How find the distance in meters between two points in a dataframe?

Question

I have a dataframe where, columns with subscript 1 are starting points and with 2 are end points. I want to find a difference in kilometers between them. I tried following code however got an error

import mpu
import pandas as pd
import numpy as np

data = {'lat1': [39.92123,  39.93883,  39.93883,  39.91034,  39.91248],
        'lon1': [116.51172, 116.51135, 116.51135, 116.51627, 116.47186],
        'lat2': [np.nan,    39.92123,  39.93883,  39.93883,  39.91034],
        'lon2': [np.nan,   116.51172, 116.51135, 116.51135, 116.51627  ]}  
  
# Create DataFrame  
df = pd.DataFrame(data)  


df['distance'] = mpu.haversine_distance((df.lat1, df.lon1), (df.lat2, df.lon2))

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

The error is as expected: result of passing Series instead of scalar values. — SeaBean, Apr 29 '21 at 19:56
you are using a method that only works on _one_ pair of coordinates. If you want to vectorize this then [sklearn.metrics.pairwise.haversine_distances](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.haversine_distances.html) might be a better choice. — DrBwts, Apr 29 '21 at 19:59

SeaBean · Accepted Answer · 2021-04-30T09:25:03.170

1

Try using .apply() with lambda function so that you can pass the coordinates as scalar values instead of now passing 4 Pandas series to the function:

df['distance'] = df.apply(lambda x: mpu.haversine_distance((x.lat1, x.lon1), (x.lat2, x.lon2)), axis=1)

You can also use list(map(...)) for faster execution, as follows:

df['distance'] = list(map(mpu.haversine_distance, zip(df.lat1, df.lon1), zip(df.lat2, df.lon2)))

edited Apr 30 '21 at 09:25

answered Apr 29 '21 at 19:53

SeaBean

22,547
3
13
25

Thank you. But apply function will do it slowly as i have 1mln. rows. Are there any other faster alternatives? – John Mayer Apr 29 '21 at 19:55
@JohnMayer Ok, you can use `list(map(...)`. Let me show you. – SeaBean Apr 29 '21 at 19:58
Thank you. I also found that vectorization can solve the problem. and the code will look like `np.vectorize(mpu.haversine_distance)((df.lat1, df.lon1), (df.lat2, df.lon2))`. or in this example vectorization cannot be applied? – John Mayer Apr 29 '21 at 20:24
@JohnMayer [`np.vectorize()`](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html) is not really doing things vectorized. You can see the doc saying "The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop." Hence, it can't make improvement over looping. – SeaBean Apr 29 '21 at 20:28
I also tried option with `list(map(...))`, however got and error `ValueError: too many values to unpack (expected 2)`, and `data.apply(lambda x: mpu.haversine_distance((x.lat1, x.lon1), (x.lat2, x.lon2), axis=1)) ` (I added one bracket) has an error ` 'Series' object has no attribute 'lat1' ` – John Mayer Apr 30 '21 at 07:32
@JohnMayer Let me look at it and come back to you. I also just noticed missing a bracket for the apply() function. Please note the missing `)` should be after `x.lon2`. I will further look at the list(map( issue. – SeaBean Apr 30 '21 at 08:30
@JohnMayer Fixed also the `list(map(...))` issue now. Both options now have no error caused by syntax now. Now both show logical error of `lat1=116.51, but must be in [-90,+90]` . This should be caused by the data. – SeaBean Apr 30 '21 at 09:02
Indeed latitude should be in this range, I updated the dataframe (just switched the numbers between latitude and longitude. However, now the same error that latitude should be in this range, however, I have NaN and this lies outside the range (dropping this row is not an option) – John Mayer Apr 30 '21 at 09:29
@JohnMayer I think maybe you need to switch to use another tool that support your coordinate ranges. Anyway, it's also a problem for those tools to support NaN. Maybe you have to drop those entries with NaN before passing to the tool. – SeaBean Apr 30 '21 at 09:49
@JohnMayer Just for your info. I benchmarked .`apply()` and `list(map(...))` with 5,000 coordinates, `list(map(...))` is more than 12x times faster. – SeaBean Apr 30 '21 at 09:52

How find the distance in meters between two points in a dataframe?

1 Answers1