2

I want to calculate the distance between two gps coordinates (the first and the last one for each tripId), to get the distance for each trip my dataframe looks like that

   tripId   latitude    longitude   timestamp
0   1817603 53.155273   8.207176    2021-05-24 00:29:22
1   1817603 53.155271   8.206898    2021-05-24 00:29:38
2   1817603 53.155213   8.206314    2021-05-24 00:29:44
3   1817603 53.155135   8.206429    2021-05-24 00:29:50
4   1817603 53.154950   8.206565    2021-05-24 00:29:56
... ... ... ... ...
195 1817888 53.092805   8.212095    2021-05-24 08:27:54
196 1817888 53.093024   8.211756    2021-05-24 08:27:59
197 1817888 53.093305   8.211383    2021-05-24 08:28:05
198 1817888 53.093594   8.211026    2021-05-24 08:28:10
199 1817888 53.093853   8.210708    2021-05-24 08:28:15

i did that for each step using s = pd.Series(haversine_vector(df, df.shift(),Unit.KILOMETERS), index=df.index, name='distance_K') but i need to know the distance for the whole trip for each id I have used this as a test and it works, but i need to know the exact duration for each trip (the final duration)

for i in range(1,df.shape[0]-1):
    if df['tripId'][i]==df['tripId'][i+1]:
        df['distance'][i]=df['distance'][i-1]+df['distance_K'][i]
    else:
        df['distance'][i]=df['distance_K'][i]
Hermoine
  • 63
  • 7
  • Have you tried `df.groupby("tripId").sum()` after you computed each distance? – OneCricketeer Nov 04 '21 at 21:23
  • @OneCricketeer, it looks like the groups need to be identified before the distance is caluculated . Else the distance between the last point of trip N and first point of trip N+1 will be taken in account. – hpchavaz Nov 04 '21 at 21:28
  • @OneCricketeer no, it doesn't work, i added a small test on the description, which works but still not complete :/ i want to know only the final exact duration – Hermoine Nov 04 '21 at 21:55
  • I'm not sure you can get `pandas` to do this. All you need is the first and last point for each trip, and from that you can compute both distance and elapsed time. I think I'd be iterating the rows, or processing it before going in to `pandas`. MAYBE you could extract `min(timestamp)` for each tripId into one DF, and `max(timestamp)` for each tripId into another df, then process the rows in parallel. – Tim Roberts Nov 04 '21 at 21:59
  • Can you check my answer if it's what you expect, please? I used the suggestion of @TimRoberts and \@OneCriketeer. – Corralien Nov 04 '21 at 22:34

2 Answers2

5

Use groupby_apply to compute haversine distance for each trip:

# Inspired by https://stackoverflow.com/a/4913653/15239951
def haversine_series(sr):
    lon1 = sr['longitude']
    lat1 = sr['latitude']
    lon2 = sr['longitude'].shift(fill_value=sr['longitude'].iloc[0])
    lat2 = sr['latitude'].shift(fill_value=sr['latitude'].iloc[0])
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat / 2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0)**2
    c = 2 * np.arcsin(np.sqrt(a))
    km = 6371 * c
    return km

df['distance_K'] = df.groupby('tripId').apply(haversine_series).droplevel(0)

Note: I suppose your dataframe is already sorted by timestamp column.

At this point, your dataframe looks like:

>>> df
      tripId   latitude  longitude            timestamp  distance_K
0    1817603  53.155273   8.207176  2021-05-24 00:29:22    0.000000
1    1817603  53.155271   8.206898  2021-05-24 00:29:38    0.018538
2    1817603  53.155213   8.206314  2021-05-24 00:29:44    0.039470
3    1817603  53.155135   8.206429  2021-05-24 00:29:50    0.011577
4    1817603  53.154950   8.206565  2021-05-24 00:29:56    0.022481
195  1817888  53.092805   8.212095  2021-05-24 08:27:54    0.000000
196  1817888  53.093024   8.211756  2021-05-24 08:27:59    0.033248
197  1817888  53.093305   8.211383  2021-05-24 08:28:05    0.039958
198  1817888  53.093594   8.211026  2021-05-24 08:28:10    0.040012
199  1817888  53.093853   8.210708  2021-05-24 08:28:15    0.035781

Now the total distance and time for each trip is easy to get with groupby_agg:

>>> df.groupby('tripId') \
      .agg(total_distance=('distance_K', 'sum'), 
           total_time=('timestamp', lambda x: x.max()-x.min())) \
      .reset_index()

    tripId  total_distance      total_time
0  1817603        0.092066 0 days 00:00:34
1  1817888        0.148999 0 days 00:00:21
Corralien
  • 109,409
  • 8
  • 28
  • 52
3

You can use

from haversine import haversine_vector

df = df.groupby('tripId').apply(
    lambda g: g.assign(distance=lambda g: [0, *haversine_vector(
        g.iloc[:-1][['latitude', 'longitude']].values,
        g.iloc[1:][['latitude', 'longitude']].values,
    )])
).droplevel(0)
df
#     tripId   latitude  longitude                timestamp  distance
# 0  1817603  53.155273   8.207176      2021-05-24 00:29:22  0.000000
# 1  1817603  53.155271   8.206898      2021-05-24 00:29:38  0.018538
# 2  1817603  53.155213   8.206314      2021-05-24 00:29:44  0.039470
# 3  1817603  53.155135   8.206429      2021-05-24 00:29:50  0.011577
# 4  1817603  53.154950   8.206565      2021-05-24 00:29:56  0.022481
# 5  1817888  53.092805   8.212095      2021-05-24 08:27:54  0.000000
# 6  1817888  53.093024   8.211756      2021-05-24 08:27:59  0.033248
# 7  1817888  53.093305   8.211383      2021-05-24 08:28:05  0.039958
# 8  1817888  53.093594   8.211026      2021-05-24 08:28:10  0.040012
# 9  1817888  53.093853   8.210708      2021-05-24 08:28:15  0.035781

and get the total time and distance

df.groupby('tripId').agg(
    {
        'timestamp': lambda g: g.iloc[-1] - g.iloc[0],
        'distance':'sum'
    }
)
#               timestamp  distance
# tripId                           
# 1817603 0 days 00:00:34  0.092066
# 1817888 0 days 00:00:21  0.148999
Nils Werner
  • 34,832
  • 7
  • 76
  • 98
  • No, this misses the point. He doesn't want step-by-step distance. He wants the total distance and the total elapsed time for each trip as a unit. – Tim Roberts Nov 04 '21 at 22:43
  • I have added a final aggregate that calculates the total distance and time difference – Nils Werner Nov 04 '21 at 22:49