6

I have some time series data and I want to calculate a groupwise rolling regression of the last n days in Pandas and store the slope of that regression in a new column.

I searched the older questions and they either haven't been answered, or used Pandas OLS which I heard is deprecated.

I figured that I probably could use df.rolling.apply() in combination with the scipy.stats.linregress function, but I can't figure out a lambda function that does what I want to do.

Here is some sample code

import numpy as np
import pandas as pd
from scipy.stats import linregress

# make sample data
days = 21
groups = ['A', 'B', 'C']
data_days = list(range(days)) * len(groups)
values = np.random.rand(days*len(groups))

df = pd.DataFrame(data=zip(sorted(groups*days), data_days, values), 
                  columns=['group', 'day', 'value'])

# calculate slope of regression of last 7 days
days_back = 7

grouped_data = df.groupby('group')
for g, data in grouped_data:
    window = data.rolling(window=days_back,
                          min_periods=days_back)

I need a new column called 'slope' in which, from day 7 onward, the slope of a linear regression through the last 7 days is stored.

MattR0se
  • 344
  • 2
  • 8

1 Answers1

6

I had some wrong assumptions, first I don't need to loop through the groups, and second I didn't really understand how rolling.apply worked...

So here is the (seemingly) working code. I used the linregress function from scipy.stats:

import numpy as np
import pandas as pd
from scipy.stats import linregress

# create random sample data
days = 14
groups = ['A', 'B', 'C']
data_days = list(range(days)) * len(groups)
values = np.random.rand(days*len(groups))

df = pd.DataFrame(data=zip(sorted(groups*days), data_days, values), 
                  columns=['group', 'day', 'value'])

def get_slope(array):
    y = np.array(array)
    x = np.arange(len(y))
    slope, intercept, r_value, p_value, std_err = linregress(x,y)
    return slope


# calculate slope of regression of last 7 days
days_back = 3

df['rolling_slope'] = df.groupby('group')['value'].rolling(window=days_back,
                               min_periods=days_back).apply(get_slope, raw=False).reset_index(0, drop=True)

print(df)
MattR0se
  • 344
  • 2
  • 8
  • ".reset_index(0, drop=True)" seems don't work for me and df.groupby('group')['value'].rolling(window=days_back, min_periods=days_back).apply(get_slope, raw=False).reset_index().iloc[:,-1] works. But thank you all the same for all the previous part! Great stuff! – uniquegino Nov 18 '19 at 17:51
  • @MattR0se I'm facing the same issue than you and I would like to implement your solution. However, it seems that your solution implicitly assumes that the delay between two days is always of 1 day. How would you adapt your code if the duration between days is not constant? – RomB Mar 04 '21 at 14:33
  • Do you mean that you have days without measurements? I haven't tried it, but I think that it should work if you insert the missing days in your data and set those measurements to `None` – MattR0se Mar 08 '21 at 10:25
  • the slope can be calculated from cov(x, y) / var(x) and applied in a rolling function as shown here(R example). https://stackoverflow.com/questions/49762128/rolling-regression-by-group-in-the-tidyverse/49764546#49764546 – Kresten Jun 09 '21 at 08:31