1

I'm trying to calculate the 10 day momentum of a statistic for a sports player that takes into account the date of the match each statistic was recorded for.

I'm not from a stats background but so far I'm cobbled together a theory that I could perform a linear regression over a rolling 10 day window where X = days elapsed from the most recent date in the window and y = the statistic. As each statistic is recorded for a given match then the window needs to exclude the statistic for the row that the output of the regression will be recorded in. The rolling regression must also be performed by player.

Here's a sample dataset - I'm looking to calculate the momentum column:

+----+--------+------------+------+----------+
| id | player | match_date | stat | momentum |
+----+--------+------------+------+----------+
|  1 |      1 | 2022-01-01 | 1500 | NaN      |
|  2 |      1 | 2022-01-03 | 1600 | NaN      |
|  3 |      1 | 2022-01-10 | 1500 | -50      |
|  4 |      1 | 2022-01-11 | 1800 | 3.73     |
|  5 |      1 | 2022-01-18 | 2100 | -300     |
|  6 |      2 | 2022-01-01 | 2100 | NaN      |
|  7 |      2 | 2022-01-03 | 1800 | NaN      |
|  8 |      2 | 2022-01-10 | 1600 | 150      |
|  9 |      2 | 2022-01-11 | 1500 | 48.51    |
| 10 |      2 | 2022-01-18 | 1500 | 100      |
+----+--------+------------+------+----------+

To illustrate how the momentum column is calculated I've added in the manual calculation of the days elapsed in the tables below:

+----+--------+------------+------+------------------+------------------+------------------+------------------+------------------+----------+
| id | player | match_date | stat | id1_days_elapsed | id2_days_elapsed | id3_days_elapsed | id4_days_elapsed | id5_days_elapsed | momentum |
+----+--------+------------+------+------------------+------------------+------------------+------------------+------------------+----------+
|  1 |      1 | 2022-01-01 | 1500 | NA               |                2 |                9 |               10 |                  | NaN      |
|  2 |      1 | 2022-01-03 | 1600 |                  |                  |                7 |                8 |                  | NaN      |
|  3 |      1 | 2022-01-10 | 1500 |                  |                  |                  |                1 |                8 | -50      |
|  4 |      1 | 2022-01-11 | 1800 |                  |                  |                  |                  |                7 | 3.73     |
|  5 |      1 | 2022-01-18 | 2100 |                  |                  |                  |                  |                  | -300     |
+----+--------+------------+------+------------------+------------------+------------------+------------------+------------------+----------+

+----+--------+------------+------+------------------+------------------+------------------+------------------+-------------------+----------+
| id | player | match_date | stat | id6_days_elapsed | id7_days_elapsed | id8_days_elapsed | id9_days_elapsed | id10_days_elapsed | momentum |
+----+--------+------------+------+------------------+------------------+------------------+------------------+-------------------+----------+
|  6 |      2 | 2022-01-01 | 2100 | NA               |                2 |                9 |               10 |                   | NaN      |
|  7 |      2 | 2022-01-03 | 1800 |                  |                  |                7 |                8 |                   | NaN      |
|  8 |      2 | 2022-01-10 | 1600 |                  |                  |                  |                1 |                 8 | 150      |
|  9 |      2 | 2022-01-11 | 1500 |                  |                  |                  |                  |                 7 | 48.51    |
| 10 |      2 | 2022-01-18 | 1500 |                  |                  |                  |                  |                   | 100      |
+----+--------+------------+------+------------------+------------------+------------------+------------------+-------------------+----------+

Most answers I've read here point to statsmodels.regression.rolling.RollingOLS but reading through the documentation this can't cope with a ragged time series which I have. I've read one answer that uses resampling when calculating a rolling mean but I'm not sure of what implications there would be for a regression.

Below is my pseudo attempt using groupby.rolling.apply:

import numpy as np
import datetime as dt
import pandas as pd
from sklearn.linear_model import LinearRegression

dates = [
    dt.datetime(2022, 1, 1),
    dt.datetime(2022, 1, 3),
    dt.datetime(2022, 1, 10),
    dt.datetime(2022, 1, 11),
    dt.datetime(2022, 1, 18),
]
df = pd.DataFrame(
    {
        "id": range(1, 11),
        "player": [1 for i in range(5)] + [2 for i in range(5)],
        "match_date": dates + dates,
        "stat": (1500, 1600, 1500, 1800, 2100, 2100, 1800, 1600, 1500, 1500)
    }
)


def calc_regression(df: pd.DataFrame, col: str):
    df = df[~df[col].isnull()]
    df["days_diff"] = df["match_date"].diff() / pd.to_timedelta(1, 'd')
    days_elapsed = df.loc[::-1, "days_diff"].cumsum()[::-1].values.reshape((-1, 1))
    model = LinearRegression()
    return model.fit(days_elapsed, df[col].values).coef_[0]


df["momentum"] = (
    df
    .groupby("player")
    .rolling("10d", on="match_date", closed="left")
    .apply(calc_regression, args=("stat", ))
)

I'm aware that this doesn't work because rolling.apply applies the calc_regression on a column by column basis. However, I thought it might be useful to explain what I'm trying to do.

In the real DataFrame there are 3m rows and 80k players so a for loop will take an age - is there an alternative?

Jossy
  • 589
  • 2
  • 12
  • 36

2 Answers2

1

Sice I don't have your data what I will give you is a general function that will

group by a given column find the parameters to approximate the column given by y parameter as a linear function of the column given by the x parameter, based on the formulas from wikipedia's Simple linear regression page

enter image description here

def grouped_rolling_linear_regression(df, window=10, groups='l', x='x', y='y'):
    '''
    Computes linear regression parameters for a rolling window
    withing grouped data where x
     - df: the input data frame
     - window: the wolling window
     - groups: the column 
    
    Returns a dataframe with columns a, b, so that
      a*x+b ~ y
    '''
    y = df[y]
    x = df[x]
    terms = pd.DataFrame({groups: df[groups], 'y': y, 'x': x, 'xx': x*x, 'xy': x*y})
    terms[1] = 1;
    s = terms.groupby(groups).rolling(window).sum()
    # compute sum((x-mean(x))*(y-mean(y)))
    Rxy = s['xy'] - s['x']*s['y'] / s[1]
    # compute sum((x-mean(x))**2)
    Rxx = s['xx'] - s['x']**2 / s[1]
    a = Rxy / Rxx
    b = (s['y'] - a * s['x']) / s[1]
    return pd.DataFrame({'a': a, 'b': b}).dropna();

Validation

You can check that it works with a small example like this

s = pd.DataFrame([{'l':1, 'x': i, 'y': i} for i in range(100)] + 
                 [{'l':2, 'x': i, 'y': -3.14*i + 482} for i in range(50)])

grouped_rolling_linear_regression(s)

This data has two groups, each with one equation, one has 100 samples, the other has 50 samples.

Performance

Since you mentioned that your concern was about taking ages to compute everything having 3M samples in 80k groups, I ran one test with random data. The generated data has 3.2M samples in 80k groups.

nplayers=80000
a,b = np.random.randn(2, 1, 40)
x = np.random.randn(80000, 40)
l = np.arange(len(x)).reshape(-1,1).repeat(40, axis=1)
y = a*x + b
s = pd.DataFrame()
for k,v in [('x',x), ('y',y), ('l',l)]:
    s[k] = v.reshape(-1)

Using the default window size of 10, this will compute 2.4M valid linear regressions, here it finished in 19 seconds.

u = grouped_rolling_linear_regression(s)

Applied to your specific problem

The only thing you have to do in order to run the regression with your data is to create a different column with a numeric type, being the difference from a date to any initial date. For instance, the earliest date in your data frame.

import pandas as pd
import datetime as dt
dates = [
    dt.datetime(2022, 1, 1),
    dt.datetime(2022, 1, 3),
    dt.datetime(2022, 1, 10),
    dt.datetime(2022, 1, 11),
    dt.datetime(2022, 1, 18),
]
df = pd.DataFrame(
    {
        "id": range(1, 11),
        "player": [1 for i in range(5)] + [2 for i in range(5)],
        "match_date": dates + dates,
        "stat": (1500, 1600, 1500, 1800, 2100, 2100, 1800, 1600, 1500, 1500)
    }
)

df['numeric-days'] = (df["match_date"] - min(df['match_date'])) / pd.to_timedelta(1, 'd')
grouped_rolling_linear_regression(df, x='numeric-days', y='stat', groups='player', window=4)

Given that the data has two groups with 5 entries each, and I choose window 4, it will fit two different lines to each category. df

Bob
  • 13,867
  • 1
  • 5
  • 27
  • Hi, thanks. Few comments... The window doesn't take account of the time series? This looks like the function creates a DataFrame `s` based upon the rolling sum of the input DataFrame `df` - the regression is then worked out on the sums which isn't what I was looking for. The regression needs to be applied to the raw data within each window with the Y being the `stat` column and X being a calculation of the days elapsed since the most recent match in the window. There's data in my example if that would help? I'll build a loop version with some real data so that there's something to aim at... – Jossy Jan 21 '22 at 17:11
  • Notice that I am constructin `terms` based on `rolling(window).sum()` it will have one pair of coefficients (a,b) for each window. The thing that I am not doing is to transform the datetime to a numeric value that you can use in calculations. – Bob Jan 21 '22 at 17:24
  • Another thing, the example data I gave have the same coefficients because all the data in a group satisfy the same linear expression. – Bob Jan 21 '22 at 18:10
  • Hi. The datetime part is key as the days elapsed needs to be calculated within each window relative to the most recent match. I've updated my question with some actual data and more of an explanation. Hope this helps. – Jossy Jan 21 '22 at 18:36
  • Sorry, I assumed yow would be able to do that part by yourself. Added one use with your example data. – Bob Jan 25 '22 at 11:12
  • Hi. Still not sure this does what I need as it works on the basis of taking a single initial date from which to calculate `numeric-days`. This calculation has to run inside every rolling window not once on the overall dataset. I refer the output I'm looking for in my post which is different to the output from your suggested function? – Jossy Jan 25 '22 at 14:50
0

I managed to adapt a useful Q&A here on how to work with groupby, rolling and apply all together across combinations of single and multiple input and output columns. On my journey I also discovered a post on np.polyfit for the regression which was useful. My final code:

def resample_and_roll(df: pd.DataFrame):
    df.reset_index(inplace=True)
    df.set_index("match_date", drop=False, inplace=True)
    df["roll_regression"] = (
        df["stat"]
        .rolling("11d", min_periods=3)
        .apply(calc_regression, args=(df, ), raw=False)
    )
    df.set_index("index", inplace=True)
    return df["roll_regression"]


def calc_regression(s: pd.Series, df: pd.DataFrame):
    y_values = s[~np.isnan(s)][:-1].values
    match_date = df.loc[s.index, "match_date"]
    days_diff = match_date.diff().dt.days.shift(-1)
    days_elapsed = days_diff.loc[::-1].cumsum()[::-1].values[:-1]
    return np.polyfit(days_elapsed, y_values, deg=1)[0]


df["momentum"] = df.groupby("player").apply(resample_and_roll).droplevel(0)

Which outputs:

   id  player match_date  stat    momentum
0   1       1 2022-01-01  1500         NaN
1   2       1 2022-01-03  1600         NaN
2   3       1 2022-01-10  1500  -50.000000
3   4       1 2022-01-11  1800    3.731343
4   5       1 2022-01-18  2100 -300.000000
5   6       2 2022-01-01  2100         NaN
6   7       2 2022-01-03  1800         NaN
7   8       2 2022-01-10  1600  150.000000
8   9       2 2022-01-11  1500   48.507463
9  10       2 2022-01-18  1500  100.000000
Jossy
  • 589
  • 2
  • 12
  • 36