I'm working on a project where I import a .gpx file and transform it to a Pandas dataframe for further analysis. This file contains location and time data from workouts from for example Strava, Endomondo, Runkeeper, and so on. I already calculated statistics such as total distance, time and speed, but then I also want to find the fastest or best time for specific distances within the workout. So for a 16 kilometer workout, I want to calculate my fastest 5k, 10k, and so on within these 16k.
I wrote something that works, but it involves looping over the dataframe. Since looping over a dataframe is something that I'm supposed to avoid I feel like there should be a more efficient solution.
The dataframe looks something like this:
distance_dis_3d time_delta
0 0.000000 0.0
1 0.000000 18.0
2 28.229476 1.0
3 5.452599 3.0
4 3.078864 1.0
...
This code works for finding the fastest 5000 meter:
df_selected['distance_cumsum'] = df_selected['distance_dis_3d'].cumsum()
df_selected['time_cumsum'] = df_selected['time_delta'].cumsum()
df_output = pd.DataFrame(columns=['time', 'distance', 'minutes_per_kilometer'])
for i in range(len(df_selected.index)):
df_xK = df_selected[(df_selected['distance_cumsum'] - df_selected['distance_cumsum'].iat[i]) >= 5000]
if(len(df_xK.index) != 0):
time = df_xK['time_cumsum'].iat[0] - df_selected['time_cumsum'].iat[i]
distance = df_xK['distance_cumsum'].iat[0] - df_selected['distance_cumsum'].iat[i]
minutes_per_kilometer = (time/60)/(distance/1000)
df_output = df_output.append({'time': time, 'distance': distance, 'minutes_per_kilometer': minutes_per_kilometer}, ignore_index=True)
best_5k = df_output.loc[df_output['minutes_per_kilometer'].idxmin()]
print('Time 5K:', floor(best_5k['time'] / 60), 'min', floor(best_5k['time'] % 60), 'sec.')
I know I should use vectorization or .apply(), but I can't figure out how to do this here. So any help is much appreciated! Thanks!
A testfile can be downloaded here: http://gofile.me/2RsVN/dos1tPTVD