I'm trying to calculate the 10 day momentum of a statistic for a sports player that takes into account the date of the match each statistic was recorded for.
I'm not from a stats background but so far I'm cobbled together a theory that I could perform a linear regression over a rolling 10 day window where X
= days elapsed from the most recent date
in the window and y
= the statistic
. As each statistic is recorded for a given match then the window needs to exclude the statistic for the row that the output of the regression will be recorded in. The rolling regression must also be performed by player
.
Here's a sample dataset - I'm looking to calculate the momentum
column:
+----+--------+------------+------+----------+
| id | player | match_date | stat | momentum |
+----+--------+------------+------+----------+
| 1 | 1 | 2022-01-01 | 1500 | NaN |
| 2 | 1 | 2022-01-03 | 1600 | NaN |
| 3 | 1 | 2022-01-10 | 1500 | -50 |
| 4 | 1 | 2022-01-11 | 1800 | 3.73 |
| 5 | 1 | 2022-01-18 | 2100 | -300 |
| 6 | 2 | 2022-01-01 | 2100 | NaN |
| 7 | 2 | 2022-01-03 | 1800 | NaN |
| 8 | 2 | 2022-01-10 | 1600 | 150 |
| 9 | 2 | 2022-01-11 | 1500 | 48.51 |
| 10 | 2 | 2022-01-18 | 1500 | 100 |
+----+--------+------------+------+----------+
To illustrate how the momentum
column is calculated I've added in the manual calculation of the days elapsed in the tables below:
+----+--------+------------+------+------------------+------------------+------------------+------------------+------------------+----------+
| id | player | match_date | stat | id1_days_elapsed | id2_days_elapsed | id3_days_elapsed | id4_days_elapsed | id5_days_elapsed | momentum |
+----+--------+------------+------+------------------+------------------+------------------+------------------+------------------+----------+
| 1 | 1 | 2022-01-01 | 1500 | NA | 2 | 9 | 10 | | NaN |
| 2 | 1 | 2022-01-03 | 1600 | | | 7 | 8 | | NaN |
| 3 | 1 | 2022-01-10 | 1500 | | | | 1 | 8 | -50 |
| 4 | 1 | 2022-01-11 | 1800 | | | | | 7 | 3.73 |
| 5 | 1 | 2022-01-18 | 2100 | | | | | | -300 |
+----+--------+------------+------+------------------+------------------+------------------+------------------+------------------+----------+
+----+--------+------------+------+------------------+------------------+------------------+------------------+-------------------+----------+
| id | player | match_date | stat | id6_days_elapsed | id7_days_elapsed | id8_days_elapsed | id9_days_elapsed | id10_days_elapsed | momentum |
+----+--------+------------+------+------------------+------------------+------------------+------------------+-------------------+----------+
| 6 | 2 | 2022-01-01 | 2100 | NA | 2 | 9 | 10 | | NaN |
| 7 | 2 | 2022-01-03 | 1800 | | | 7 | 8 | | NaN |
| 8 | 2 | 2022-01-10 | 1600 | | | | 1 | 8 | 150 |
| 9 | 2 | 2022-01-11 | 1500 | | | | | 7 | 48.51 |
| 10 | 2 | 2022-01-18 | 1500 | | | | | | 100 |
+----+--------+------------+------+------------------+------------------+------------------+------------------+-------------------+----------+
Most answers I've read here point to statsmodels.regression.rolling.RollingOLS
but reading through the documentation this can't cope with a ragged time series which I have. I've read one answer that uses resampling when calculating a rolling mean but I'm not sure of what implications there would be for a regression.
Below is my pseudo attempt using groupby.rolling.apply
:
import numpy as np
import datetime as dt
import pandas as pd
from sklearn.linear_model import LinearRegression
dates = [
dt.datetime(2022, 1, 1),
dt.datetime(2022, 1, 3),
dt.datetime(2022, 1, 10),
dt.datetime(2022, 1, 11),
dt.datetime(2022, 1, 18),
]
df = pd.DataFrame(
{
"id": range(1, 11),
"player": [1 for i in range(5)] + [2 for i in range(5)],
"match_date": dates + dates,
"stat": (1500, 1600, 1500, 1800, 2100, 2100, 1800, 1600, 1500, 1500)
}
)
def calc_regression(df: pd.DataFrame, col: str):
df = df[~df[col].isnull()]
df["days_diff"] = df["match_date"].diff() / pd.to_timedelta(1, 'd')
days_elapsed = df.loc[::-1, "days_diff"].cumsum()[::-1].values.reshape((-1, 1))
model = LinearRegression()
return model.fit(days_elapsed, df[col].values).coef_[0]
df["momentum"] = (
df
.groupby("player")
.rolling("10d", on="match_date", closed="left")
.apply(calc_regression, args=("stat", ))
)
I'm aware that this doesn't work because rolling.apply
applies the calc_regression
on a column by column basis. However, I thought it might be useful to explain what I'm trying to do.
In the real DataFrame there are 3m rows and 80k players so a for loop will take an age - is there an alternative?