rolling regression with a simple apply in pandas

Question

Consider this simple example

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

df = pd.DataFrame({'a':[1,3,5,7,4,5,6,4,7,8,9,1,3,5,7,4,5,6,4,7,8,9],
                   'b':[3,5,6,2,4,6,2,5,7,1,9,5,3,2,5,4,3,6,4,1,1,9]})

I am trying to perform a rolling regression of a on b. I am trying to use the simplest pandas tool available: apply. I want to use apply because I want to keep the flexibility of returning any parameter of the regression.

However, the simple code below does not work

df.rolling(10).apply(lambda x: smf.ols('a ~ b', data = x).fit())

  File "<string>", line 1, in <module>

PatsyError: Error evaluating factor: NameError: name 'b' is not defined
    a ~ b
    ^

What is the issue? Thanks!

So rolling apply will only perform the apply function to 1 column at a time, hence being unable to refer to multiple columns. `rolling` objects are iterable so you _could_ do something like `[smf.ols('a ~ b', data=x).fit() for x in df.rolling(10)]` but it's unclear what you want your results to be since this will just give a list/column of `RegressionResultsWrapper` objects. (also does not address issues like the first rolling would only have 1 value which would cause a ValueError) — Henry Ecker, Nov 14 '21 at 20:20
interesting. thanks @HenryEcker. Let's say I want to return the coefficient on `b` and its `t`stat and put these as two separate columns in the original `df` dataframe. How would do you that then? — ℕʘʘḆḽḘ, Nov 14 '21 at 20:22
I really don't know. I don't know how many results you expect to get. Rolling is going to start with scaling 1,2,3,4,...10 row chunks until it can start rolling 10 rows at a time. Which of these do you want to process? Which values from the `fit` are you looking for? There are more than a few ways to extract and process the results. It would be helpful if you could show extracting the results for a single group, then explain generally what the results should look like after. — Henry Ecker, Nov 14 '21 at 20:30
Like do you want 9 NaN rows at top or use the smaller collections as well? How do you want the coef and t? Just the intercept or also the b? do you want that in an equation formula? etc. etc. — Henry Ecker, Nov 14 '21 at 20:30
yes, I would have 9 NaN at the top of the dataframe, because we dont have enough observations (we need 10). I am looking for the the coefficient of `b` and its confidence interval (upper and lower bound). These should be available in `reg.params` and `reg.conf_inf` I believe. Thanks!!! — ℕʘʘḆḽḘ, Nov 14 '21 at 20:34

score 2 · Accepted Answer · answered Nov 14 '21 at 20:44

rolling apply is not capable of interacting with multiple columns simultaneously, nor is it able to produce non-numeric values. We instead need to take advantage of the iterable nature of rolling objects. We also need to account for handling min_periods ourselves, since the iterable rolling object generates all windows results regardless of other rolling arguments.

We can then create some function to produce each row in the results from the regression results to do something like:

def process(x):
    if len(x) >= 10:
        reg = smf.ols('a ~ b', data=x).fit()
        print(reg.params)
        return [
            # b from params
            reg.params['b'],
            # b from tvalues
            reg.tvalues['b'],
            # Both lower and upper b from conf_int()
            *reg.conf_int().loc['b', :].tolist()
        ]
    # Return NaN in the same dimension as the results
    return [np.nan] * 4


df = df.join(
    # join new DataFrame back to original
    pd.DataFrame(
        (process(x) for x in df.rolling(10)),
        columns=['coef', 't', 'lower', 'upper']
    )
)

df:

    a  b      coef         t     lower     upper
0   1  3       NaN       NaN       NaN       NaN
1   3  5       NaN       NaN       NaN       NaN
2   5  6       NaN       NaN       NaN       NaN
3   7  2       NaN       NaN       NaN       NaN
4   4  4       NaN       NaN       NaN       NaN
5   5  6       NaN       NaN       NaN       NaN
6   6  2       NaN       NaN       NaN       NaN
7   4  5       NaN       NaN       NaN       NaN
8   7  7       NaN       NaN       NaN       NaN
9   8  1 -0.216802 -0.602168 -1.047047  0.613442
10  9  9  0.042781  0.156592 -0.587217  0.672778
11  1  5  0.032086  0.097763 -0.724742  0.788913
12  3  3  0.113475  0.329006 -0.681872  0.908822
13  5  2  0.198582  0.600297 -0.564258  0.961421
14  7  5  0.203540  0.611002 -0.564646  0.971726
15  4  4  0.236599  0.686744 -0.557872  1.031069
16  5  3  0.293651  0.835945 -0.516403  1.103704
17  6  6  0.314286  0.936382 -0.459698  1.088269
18  4  4  0.276316  0.760812 -0.561191  1.113823
19  7  1  0.346491  1.028220 -0.430590  1.123572
20  8  1 -0.492424 -1.234601 -1.412181  0.427332
21  9  9  0.235075  0.879433 -0.381326  0.851476

Setup:

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

df = pd.DataFrame({
    'a': [1, 3, 5, 7, 4, 5, 6, 4, 7, 8, 9, 1, 3, 5, 7, 4, 5, 6, 4, 7, 8, 9],
    'b': [3, 5, 6, 2, 4, 6, 2, 5, 7, 1, 9, 5, 3, 2, 5, 4, 3, 6, 4, 1, 1, 9]
})

super smooth solution. thanks!! – ℕʘʘḆḽḘ Nov 14 '21 at 20:57 — ℕʘʘḆḽḘ, Nov 14 '21 at 20:57

Rodalm · Answer 2 · 2021-11-14T21:25:03.093

2

Rolling.apply applies the rolling operation to each column separately (Related question).

Following user3226167's answer of this thread, it seems that easiest way to accomplish what you want is to use RollingOLS.from_formula from statsmodels.regression.rolling.

from statsmodels.regression.rolling import RollingOLS

df = pd.DataFrame({'a':[1,3,5,7,4,5,6,4,7,8,9,1,3,5,7,4,5,6,4,7,8,9],
                   'b':[3,5,6,2,4,6,2,5,7,1,9,5,3,2,5,4,3,6,4,1,1,9]})

model = RollingOLS.from_formula('a ~ b', data = df, window=10)

reg_obj = model.fit()

# estimated coefficient
b_coeff = reg_obj.params['b'].rename('coef')

# b t-value 
b_t_val = reg_obj.tvalues['b'].rename('t')

# 95 % confidence interval of b
b_conf_int = reg_obj.conf_int(cols=[1]).droplevel(level=0, axis=1)

# join all the desired information to the original df
df = df.join([b_coeff, b_t_val, b_conf_int])

where reg_obj is a RollingRegressionResults which holds lots of information about the regression (see all its different attributes in the docs)

Output

>>> type(reg_obj)
<class 'statsmodels.regression.rolling.RollingRegressionResults'>

>>> df

    a  b      coef         t     lower     upper
0   1  3       NaN       NaN       NaN       NaN
1   3  5       NaN       NaN       NaN       NaN
2   5  6       NaN       NaN       NaN       NaN
3   7  2       NaN       NaN       NaN       NaN
4   4  4       NaN       NaN       NaN       NaN
5   5  6       NaN       NaN       NaN       NaN
6   6  2       NaN       NaN       NaN       NaN
7   4  5       NaN       NaN       NaN       NaN
8   7  7       NaN       NaN       NaN       NaN
9   8  1 -0.216802 -0.602168 -0.922460  0.488856
10  9  9  0.042781  0.156592 -0.492679  0.578240
11  1  5  0.032086  0.097763 -0.611172  0.675343
12  3  3  0.113475  0.329006 -0.562521  0.789472
13  5  2  0.198582  0.600297 -0.449786  0.846949
14  7  5  0.203540  0.611002 -0.449372  0.856452
15  4  4  0.236599  0.686744 -0.438653  0.911851
16  5  3  0.293651  0.835945 -0.394846  0.982147
17  6  6  0.314286  0.936382 -0.343553  0.972125
18  4  4  0.276316  0.760812 -0.435514  0.988146
19  7  1  0.346491  1.028220 -0.313981  1.006963
20  8  1 -0.492424 -1.234601 -1.274162  0.289313
21  9  9  0.235075  0.879433 -0.288829  0.758978

edited Nov 14 '21 at 21:25

answered Nov 14 '21 at 20:54

Rodalm

5,169
5
21

thanks! but how can I get back the parameters as in the other solution? – ℕʘʘḆḽḘ Nov 14 '21 at 20:56
thanks harry, but I meant other parameters like the confidence interval. Still useful, thanks! – ℕʘʘḆḽḘ Nov 14 '21 at 21:04
1

I really like this answer, but I can't seem to get it to work in 0.13.1... What version are you using of statsmodels? – Henry Ecker Nov 14 '21 at 21:08
@ℕʘʘḆḽḘ I updated the answer with the same columns as in Henry's solution. – Rodalm Nov 14 '21 at 21:27
@HenryEcker I don't know much about `statsmodels` honestly. My version is 0.12.2. What error do you get? – Rodalm Nov 14 '21 at 21:28
`AttributeError: 'NoneType' object has no attribute 'f_locals'` idk. I've been really scratching my head on this for a bit of time. – Henry Ecker Nov 14 '21 at 21:58
@HenryEcker on which line? – Rodalm Nov 14 '21 at 22:02
`model = RollingOLS.from_formula('a ~ b', data = df, window=10)` – Henry Ecker Nov 14 '21 at 22:15
@HenryEcker did you manage to solve it? I could reproduce the error by updating to 0.13.1. I was looking at the source code and it has to do with the formula parsing, but I didn't look too deeply into it. Anyway, it works if you pass `eval_env=-1` to `RollingOLS.from_formula`, don't ask me why ;) – Rodalm Nov 15 '21 at 02:59
No I never was able to fix it I just downgraded... – Henry Ecker Nov 15 '21 at 03:26

rolling regression with a simple apply in pandas

2 Answers2