0

I have a dataframe where, columns with subscript 1 are starting points and with 2 are end points. I want to find a difference in kilometers between them. I tried following code however got an error

import mpu
import pandas as pd
import numpy as np

data = {'lat1': [39.92123,  39.93883,  39.93883,  39.91034,  39.91248],
        'lon1': [116.51172, 116.51135, 116.51135, 116.51627, 116.47186],
        'lat2': [np.nan,    39.92123,  39.93883,  39.93883,  39.91034],
        'lon2': [np.nan,   116.51172, 116.51135, 116.51135, 116.51627  ]}  
  
# Create DataFrame  
df = pd.DataFrame(data)  


df['distance'] = mpu.haversine_distance((df.lat1, df.lon1), (df.lat2, df.lon2))

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

John Mayer
  • 103
  • 7
  • You're not going to tell us what the error is? – takendarkk Apr 29 '21 at 19:46
  • @takendarkk Sorry, I added – John Mayer Apr 29 '21 at 19:53
  • 1
    The error is as expected: result of passing Series instead of scalar values. – SeaBean Apr 29 '21 at 19:56
  • you are using a method that only works on _one_ pair of coordinates. If you want to vectorize this then [sklearn.metrics.pairwise.haversine_distances](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.haversine_distances.html) might be a better choice. – DrBwts Apr 29 '21 at 19:59

1 Answers1

1

Try using .apply() with lambda function so that you can pass the coordinates as scalar values instead of now passing 4 Pandas series to the function:

df['distance'] = df.apply(lambda x: mpu.haversine_distance((x.lat1, x.lon1), (x.lat2, x.lon2)), axis=1)

You can also use list(map(...)) for faster execution, as follows:

df['distance'] = list(map(mpu.haversine_distance, zip(df.lat1, df.lon1), zip(df.lat2, df.lon2)))
SeaBean
  • 22,547
  • 3
  • 13
  • 25
  • Thank you. But apply function will do it slowly as i have 1mln. rows. Are there any other faster alternatives? – John Mayer Apr 29 '21 at 19:55
  • @JohnMayer Ok, you can use `list(map(...)`. Let me show you. – SeaBean Apr 29 '21 at 19:58
  • Thank you. I also found that vectorization can solve the problem. and the code will look like `np.vectorize(mpu.haversine_distance)((df.lat1, df.lon1), (df.lat2, df.lon2))`. or in this example vectorization cannot be applied? – John Mayer Apr 29 '21 at 20:24
  • @JohnMayer [`np.vectorize()`](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html) is not really doing things vectorized. You can see the doc saying "The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop." Hence, it can't make improvement over looping. – SeaBean Apr 29 '21 at 20:28
  • I also tried option with `list(map(...))`, however got and error `ValueError: too many values to unpack (expected 2)`, and `data.apply(lambda x: mpu.haversine_distance((x.lat1, x.lon1), (x.lat2, x.lon2), axis=1)) ` (I added one bracket) has an error ` 'Series' object has no attribute 'lat1' ` – John Mayer Apr 30 '21 at 07:32
  • @JohnMayer Let me look at it and come back to you. I also just noticed missing a bracket for the apply() function. Please note the missing `)` should be after `x.lon2`. I will further look at the list(map( issue. – SeaBean Apr 30 '21 at 08:30
  • @JohnMayer Fixed also the `list(map(...))` issue now. Both options now have no error caused by syntax now. Now both show logical error of `lat1=116.51, but must be in [-90,+90]` . This should be caused by the data. – SeaBean Apr 30 '21 at 09:02
  • Indeed latitude should be in this range, I updated the dataframe (just switched the numbers between latitude and longitude. However, now the same error that latitude should be in this range, however, I have NaN and this lies outside the range (dropping this row is not an option) – John Mayer Apr 30 '21 at 09:29
  • @JohnMayer I think maybe you need to switch to use another tool that support your coordinate ranges. Anyway, it's also a problem for those tools to support NaN. Maybe you have to drop those entries with NaN before passing to the tool. – SeaBean Apr 30 '21 at 09:49
  • @JohnMayer Just for your info. I benchmarked .`apply()` and `list(map(...))` with 5,000 coordinates, `list(map(...))` is more than 12x times faster. – SeaBean Apr 30 '21 at 09:52