Python function to calculate distance using haversine formula in pandas

Question

(IPython notebook) (Bus statistics)

summary.head()

I need to calculate distance_travelled between each two rows, where 1) row['sequence'] != 0, since there is no distance when the bus is at his initial stop 2) row['track_id'] == previous_row['track_id'].

I have haversine formula defined:

def haversine(lon1, lat1, lon2, lat2):

      lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

# haversine formula 
dlon = lon2 - lon1 
dlat = lat2 - lat1 
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a)) 
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r

I am not exactly sure how to go about this. One of the ideas is use itterrows() and apply harvesine() function, if rows 'sequence' parameter is not 0 and row's 'track_id' is equal to previous row's 'track_id'

[EDIT] I figured there is no need to check if 'track_id' of row and previous row is the same, since the haversine() function is applied to two rows only, and when sequence = 0, that row's distance == 0, which means that the track_id has changed. So, basically, apply haversine() function to all rows whose 'sequence' != 0, ie haversine(previous_row.lng, previous_row.lat, current_row.lng, current_row.lat). Still need help with that though

[EDIT 2] I managed to achieve something similar with:

summary['distance_travelled'] = summary.apply(lambda row: haversine(row['lng'], row['lat'], previous_row['lng'], previous_row['lat']), axis=1)

where previous_row should actually be previous_row, since now it is only a placeholder string, which does nothing.

Isn't this a a dupe of this: http://stackoverflow.com/questions/25767596/using-haversine-formula-with-data-stored-in-a-pandas-dataframe/25767765#25767765? — EdChum, Dec 29 '15 at 19:07

score 2 · Accepted Answer · answered Dec 29 '15 at 13:14

IIUC you can try:

print summary

  track_id  sequence        lat        lng  distance_travelled
0      1-1         0  41.041870  29.060010                   0
4      1-1         1  41.040859  29.059980                   0
6      1-1         2  41.039242  29.059731                   0
#create new shifted columns  
summary['latp'] = summary['lat'].shift(1)
summary['lngp'] = summary['lng'].shift(1)
print summary

  track_id  sequence        lat        lng  distance_travelled       latp  \
0      1-1         0  41.041870  29.060010                   0        NaN   
4      1-1         1  41.040859  29.059980                   0  41.041870   
6      1-1         2  41.039242  29.059731                   0  41.040859   

       lngp  
0       NaN  
4  29.06001  
6  29.05998  
summary['distance_travelled'] = summary.apply(lambda row: haversine(row['lng'], row['lat'], row['lngp'], row['latp']), axis=1)
#remove column lngp, latp
summary = summary.drop(['lngp','latp'], axis=1)
print summary

  track_id  sequence        lat        lng  distance_travelled
0      1-1         0  41.041870  29.060010                 NaN
4      1-1         1  41.040859  29.059980            0.112446
6      1-1         2  41.039242  29.059731            0.181011

If performance matters, calling `.apply(haversine, axis=1)` will be much slower than writing `haversine` to take numpy arrays and doing `summary['distance_travelled'] = haversine(summary['lng'], summary['lat'], summary['lngp'], summary['latp'])` — TomAugspurger, Dec 29 '15 at 14:46

Python function to calculate distance using haversine formula in pandas

1 Answers1

Linked