0

I have a df that looks like this, where I've grouped by id

 id     lat          lon
 1       NaN         NaN
 1       40.121      23.749
 1      -56.154     -39.572
 1       21.908      17.537
 1       31.221     -36.186
 1      -56.655      0.016
 2       NaN         NaN
 2      -36.438      14.874
 2      -21.422      81.271
 2       43.961     -95.551
 3       NaN         NaN
 3       79.821     -56.781

Using the haversine function, I'd like to calculate the distance of the current row to the previous row. So the first entry of the new column would be calculated by using

lat 1 = 40.121

lon 1 = 23.749

lat 2 = -56.154

lon 2 = -39.572

Del-m10
  • 3
  • 5

1 Answers1

0

Adapted from this answer. The linked answer shows how to compute distance between each row and some fixed value for longitude/latitude - my adaptation allows it to work for your case.

First, get all the values you need on the same row using shift:

df['lon2'] = df['lon'].shift(-1)
df['lat2'] = df['lat'].shift(-1)

Giving:

    id     lat     lon    lat2    lon2
0    1     NaN     NaN  40.121  23.749
1    1  40.121  23.749 -56.154 -39.572
2    1 -56.154 -39.572  21.908  17.537
3    1  21.908  17.537  31.221 -36.186
4    1  31.221 -36.186 -56.655   0.016
5    1 -56.655   0.016     NaN     NaN
6    2     NaN     NaN -36.438  14.874
7    2 -36.438  14.874 -21.422  81.271
8    2 -21.422  81.271  43.961 -95.551
9    2  43.961 -95.551     NaN     NaN
10   3     NaN     NaN  79.821 -56.781
11   3  79.821 -56.781     NaN     NaN

Then define function for distance calculation:

from numpy import cos, sin, arcsin, sqrt
from math import radians

def haversine(row):
    lon1 = row['lon']
    lat1 = row['lat']
    lon2 = row['lon2']
    lat2 = row['lat2']
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * arcsin(sqrt(a)) 
    km = 6367 * c
    return km

And apply it to your data using apply:

df['distance'] = df.apply(haversine, axis=1)

Giving:

    id     lat     lon    lat2    lon2      distance
0    1     NaN     NaN  40.121  23.749           NaN
1    1  40.121  23.749 -56.154 -39.572  12237.017692
2    1 -56.154 -39.572  21.908  17.537  10187.684397
3    1  21.908  17.537  31.221 -36.186   5387.540299
4    1  31.221 -36.186 -56.655   0.016  10343.267833
5    1 -56.655   0.016     NaN     NaN           NaN
6    2     NaN     NaN -36.438  14.874           NaN
7    2 -36.438  14.874 -21.422  81.271   6543.302199
8    2 -21.422  81.271  43.961 -95.551  17480.809345
9    2  43.961 -95.551     NaN     NaN           NaN
10   3     NaN     NaN  79.821 -56.781           NaN
11   3  79.821 -56.781     NaN     NaN           NaN

Which I believe shows the results you are looking for (I tested the first one and it seems to be correct).

If you like, you can get rid of the two secondary lat/lon columns once the calculations are complete:

df.drop(['lat2', 'lon2'], axis=1, inplace=True)

I should note that this solution will not give you the fastest possible calculations. See the second half of the answer I linked for an exploration of how this can be improved if performance is top priority here, although it will need to be adapted.

sjw
  • 6,213
  • 2
  • 24
  • 39
  • Thanks for the help! One quick question, when u use the haversine function in apply(), why don't u have to also insert the argument? – Del-m10 Feb 21 '19 at 10:00
  • apply() effectively loops over each row, passing the row automatically (or column, if you set axis=0 rather than 1) as the parameter to the function. – sjw Feb 21 '19 at 10:19