Finding the "best" slice in a Pandas dataframe without looping

Question

I'm working on a project where I import a .gpx file and transform it to a Pandas dataframe for further analysis. This file contains location and time data from workouts from for example Strava, Endomondo, Runkeeper, and so on. I already calculated statistics such as total distance, time and speed, but then I also want to find the fastest or best time for specific distances within the workout. So for a 16 kilometer workout, I want to calculate my fastest 5k, 10k, and so on within these 16k.

I wrote something that works, but it involves looping over the dataframe. Since looping over a dataframe is something that I'm supposed to avoid I feel like there should be a more efficient solution.

The dataframe looks something like this:

    distance_dis_3d time_delta
0   0.000000        0.0
1   0.000000        18.0
2   28.229476       1.0
3   5.452599        3.0
4   3.078864        1.0
...

This code works for finding the fastest 5000 meter:

df_selected['distance_cumsum'] = df_selected['distance_dis_3d'].cumsum()
df_selected['time_cumsum'] = df_selected['time_delta'].cumsum()

df_output = pd.DataFrame(columns=['time', 'distance', 'minutes_per_kilometer'])

for i in range(len(df_selected.index)):

    df_xK = df_selected[(df_selected['distance_cumsum'] - df_selected['distance_cumsum'].iat[i]) >= 5000]
    if(len(df_xK.index) != 0):
        time = df_xK['time_cumsum'].iat[0] - df_selected['time_cumsum'].iat[i]
        distance = df_xK['distance_cumsum'].iat[0] - df_selected['distance_cumsum'].iat[i]
        minutes_per_kilometer = (time/60)/(distance/1000)
        df_output = df_output.append({'time': time, 'distance': distance, 'minutes_per_kilometer': minutes_per_kilometer}, ignore_index=True)

best_5k = df_output.loc[df_output['minutes_per_kilometer'].idxmin()]

print('Time 5K:', floor(best_5k['time'] / 60), 'min', floor(best_5k['time'] % 60), 'sec.')

I know I should use vectorization or .apply(), but I can't figure out how to do this here. So any help is much appreciated! Thanks!

A testfile can be downloaded here: http://gofile.me/2RsVN/dos1tPTVD

could you provide a sample dataframe with the fastest 5000 meter included so one can provide a possible solution. Also include an expected output. https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — sammywemmy, Jan 29 '20 at 21:32
do you bin the data first? what are the rules for 4k, does that fall into 5k? — Umar.H, Jan 29 '20 at 21:35
Are you looking for a rolling 5k(1-5, 2-6, 3-7...) or fixed (1-5, 6-10, 10-15)? — Ethan, Jan 29 '20 at 21:37
@sammywemmy, I've added a link testfile. The data is transformed to a dataframe using the same code as described here: https://towardsdatascience.com/how-tracking-apps-analyse-your-gps-data-a-hands-on-tutorial-in-python-756d4db6715d — WouterNieuwerth, Jan 29 '20 at 21:44
@Datanovice, I'm basically trying to find the slice of the dataframe that has a cumulative distance over 5k with the smallest cumulative time difference. — WouterNieuwerth, Jan 29 '20 at 21:47
@Ethan It should be a rolling 5k. The fastest 5k within the entire workout could start at any point (or row in the dataframe). — WouterNieuwerth, Jan 29 '20 at 21:49

score 0 · Answer 1 · answered Jan 30 '20 at 12:49

0

Sorry I didn't get this to you sooner, but I think this is what you are looking for.

def rollKilos(kilometers):
    df = pd.DataFrame.from_dict({'KM':(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)
    ,'Time':(5.3, 5.25, 5.35, 5.36, 5.42, 5.2, 5.25, 5.5, 5.4, 5.15, 5.25, 5.35, 5.3, 5.2, 5.3, 5.1)})
    df = df.set_index('KM')
    return min(df.rolling(kilometers).mean()['Time'][kilometers:])

df.rolling(5).mean()
Out[23]: 
     Time
KM       
1     NaN
2     NaN
3     NaN
4     NaN
5   5.336
6   5.316
7   5.316
8   5.346
9   5.354
10  5.300
11  5.310
12  5.330
13  5.290
14  5.250
15  5.280
16  5.250

rollKilos(5)
Out[30]: 5.249999999999998

rollKilos(3)
Out[31]: 5.200000000000002

rollKilos(10)
Out[32]: 5.279999999999999

answered Jan 30 '20 at 12:49

Ethan

1,363
1
6
8

1

thank you for your answer! It's really close to what I'm looking for, but the problem is that the distances and times are not evenly spaced. I don't know exactly what number to pick for .rolling(#), because it can be different for every row. Based on your example dataframe, the data looks more like this unfortunately: `df = pd.DataFrame.from_dict({'KM':(1,4,5,6,9,10,11,12,14,15,16), 'Time':(5.3, 5.25, 5.5, 5.4, 5.15, 5.25, 5.35, 5.3, 5.2, 5.3, 5.1)})` So from 1km to 5km is 3 rows, but from 11km to 15km (same distance) is 4 rows. – WouterNieuwerth Jan 30 '20 at 18:33
@WouterNieuwerth Sorry, I didn’t pick up that the data could be sparse. In that case the only suggestion I could make is create your dataframe with rows in 1 kilometer increments, calculate the average time for each record in your dataset then fill forward for those kilometers with no data. In reality, your dataset is incomplete and can never accurately calculate the fast 5K. If the first KM was slow but next was very fast, and the 3rd was slow the fast KM would be masked. – Ethan Jan 30 '20 at 21:40
Thanks @Ethan. I guess I'll stick to the code I have and accept that it needs a bit of time to munch through all the data. :) – WouterNieuwerth Jan 31 '20 at 19:07

Finding the "best" slice in a Pandas dataframe without looping

1 Answers1