3

I have a dataset storing marathon segment splits (5K, 10K, ...) in seconds and identifiers (age, gender, country) as columns and individuals as rows. Each cell for a marathon segment split column may contain either a float (specifying the number of seconds required to reach the segment) or "NaN". A row may contain up to 4 NaN values. Here is some sample data:

      Age M/F Country      5K     10K     15K     20K    Half  Official Time
2323   38   M     CHI  1300.0  2568.0  3834.0  5107.0  5383.0        10727.0
2324   23   M     USA  1286.0  2503.0  3729.0  4937.0  5194.0        10727.0
2325   36   M     USA  1268.0  2519.0  3775.0  5036.0  5310.0        10727.0
2326   37   M     POL  1234.0  2484.0  3723.0  4972.0  5244.0        10727.0
2327   32   M     DEN     NaN  2520.0  3782.0  5046.0  5319.0        10728.0

I intend to calculate a best fit line for marathon split times (using only the columns between "5K" to "Half") for each row with at least one NaN; from the best fit line for the row, I want to impute a data point to replace the NaN with.

From the sample data, I intend to calculate a best fit line for row 2327 only (using values 2520.0, 3782.0, 5046.0, and 5319.0). Using this best fit line for row 2327, I intend to replace the NaN 5K time with the predicted 5K time.

How can I calculate this best fit line for each row with NaN?

Thanks in advance.

1 Answers1

1

I "extrapolated" a solution from here from 2015 https://stackoverflow.com/a/31340344/6366770 (pun intended). Extrapolation definition I am not sure if in 2021 pandas has reliable extrapolation methods, so you might have to use scipy or other libraries.

When doing the Extrapolation , I excluded the "Half" column. That's because the running distances of 5K, 10K, 15K and 20K are 100% linear. It is literally a straight line if you exclude the half marathon column. But, that doesn't mean that expected running times are linear. Obviously, as you run a longer distance your average time per kilometer is lower. But, this "gets the job done" without getting too involved in an incredibly complex calculation.

Also, this is worth noting. Let's say that the first column was 1K instead of 5K. Then, this method would fail. It only works because the distances are linear. If it was 1K, you would also have to use the data from the rows of the other runners, unless you were making calculations based off the kilometers in the column names themselves. Either way, this is an imperfect solution, but much better than pd.interpolation. I linked another potential solution in the comments of tdy's answer.

import scipy as sp
import pandas as pd

# we focus on the four numeric columns from 5K-20K and and Transpose the dataframe, since we are going horizontally across columns. T
#T he index must also be numeric, so we drop it, but don't worry, we add back just the numbers and maintain the index later on.
df_extrap = df.iloc[:,4:8].T.reset_index(drop=True)

# create a scipy interpolation function to be called by a custom extrapolation function  later on
def scipy_interpolate_func(s):
    s_no_nan = s.dropna()
    return sp.interpolate.interp1d(s_no_nan.index.values, s_no_nan.values, kind='linear', bounds_error=False)


def my_extrapolate_func(scipy_interpolate_func, new_x):
    x1, x2 = scipy_interpolate_func.x[0], scipy_interpolate_func.x[-1]
    y1, y2 = scipy_interpolate_func.y[0], scipy_interpolate_func.y[-1]
    slope = (y2 - y1) / (x2 - x1)
    return y1 + slope * (new_x - x1)

#Concat each extrapolated column altogether and transpose back to initial shape to be added to the original dataframe
s_extrapolated = pd.concat([pd.Series(my_extrapolate_func(scipy_interpolate_func(df_extrap[s]), 
                                                              df_extrap[s].index.values), 
                                          index=df_extrap[s].index) for s in df_extrap.columns], axis=1).T
cols = ['5K', '10K', '15K', '20K']
df[cols] = s_extrapolated
df

Out[1]: 
   index  Age M/F Country      5K     10K     15K     20K    Half  \
0   2323   38   M     CHI  1300.0  2569.0  3838.0  5107.0  5383.0   
1   2324   23   M     USA  1286.0  2503.0  3720.0  4937.0  5194.0   
2   2325   36   M     USA  1268.0  2524.0  3780.0  5036.0  5310.0   
3   2326   37   M     POL  1234.0  2480.0  3726.0  4972.0  5244.0   
4   2327   32   M     DEN  1257.0  2520.0  3783.0  5046.0  5319.0   

   Official Time  
0        10727.0  
1        10727.0  
2        10727.0  
3        10727.0  
4        10728.0  
David Erickson
  • 16,433
  • 2
  • 19
  • 35
  • nice! would be great if pandas/scipy could handle this more seamlessly, but it looks like this is the best we've got for now. +10 – tdy Aug 25 '21 at 16:50