rolling regression by group in the tidyverse?

Question

There are many questions about rolling regression in R, but here I am specifically looking for something that uses dplyr, broom and (if needed) purrr.

This is what makes this question different. I want to be tidyverse consistent. Is is possible to do a proper running regression with tidy tools such as purrr:map and dplyr?

Please consider this simple example:

library(dplyr)
library(purrr)
library(broom)
library(zoo)
library(lubridate)

mydata = data_frame('group' = c('a','a', 'a','a','b', 'b', 'b', 'b'),
                     'y' = c(1,2,3,4,2,3,4,5),
                     'x' = c(2,4,6,8,6,9,12,15),
                     'date' = c(ymd('2016-06-01', '2016-06-02', '2016-06-03', '2016-06-04',
                                    '2016-06-03', '2016-06-04', '2016-06-05','2016-06-06')))

  group     y     x date      
  <chr> <dbl> <dbl> <date>    
1 a      1.00  2.00 2016-06-01
2 a      2.00  4.00 2016-06-02
3 a      3.00  6.00 2016-06-03
4 a      4.00  8.00 2016-06-04
5 b      2.00  6.00 2016-06-03
6 b      3.00  9.00 2016-06-04
7 b      4.00 12.0  2016-06-05
8 b      5.00 15.0  2016-06-06

For each group (in this example, a or b):

compute the rolling regression of y on x over the last 2 observations.
store the coefficient of that rolling regression in a column of the dataframe.

Of course, as you can see, the rolling regression can only be computed for the last 2 rows in each group.

I have tried to use the following, but without success.

data %>% group_by(group) %>% 
  mutate(rolling_coef = do(tidy(rollapply(. ,
                    width=2, 
                    FUN = function(df) {t = lm(formula=y ~ x, 
                                              data = as.data.frame(df), 
                                              na.rm=TRUE); 
                    return(t$coef) },
                    by.column=FALSE, align="right"))))
Error in mutate_impl(.data, dots) : 
  Evaluation error: subscript out of bounds.
In addition: There were 21 warnings (use warnings() to see them)

Any ideas?

Expected output for the last two rows of the first a group is 0.5 and 0.5 (there is indeed a perfect linear correlation between y and x in this example)

More specifically:

mydata_1 <- mydata %>% filter(group == 'a',
                  row_number() %in% c(1,2))
# A tibble: 2 x 3
  group     y     x
  <chr> <dbl> <dbl>
1 a      1.00  2.00
2 a      2.00  4.00
> tidy(lm(y ~ x, mydata_1))['estimate'][2,]
[1] 0.5

and also

mydata_2 <- mydata %>% filter(group == 'a',
                              row_number() %in% c(2,3)) 
# A tibble: 2 x 3
  group     y     x
  <chr> <dbl> <dbl>
1 a      2.00  4.00
2 a      3.00  6.00
> tidy(lm(y ~ x, mydata_2))['estimate'][2,]
[1] 0.5

EDIT:

interesting follow-up to this question here rolling regression with confidence interval (tidyverse)

Possible duplicate of [Using rollapply and lm over multiple columns of data](https://stackoverflow.com/questions/33813627/using-rollapply-and-lm-over-multiple-columns-of-data) — Maurits Evers, Apr 10 '18 at 22:17
no duplicate at all. I know that SO question. What I am looking for is a `purrr`, tidyverse solution which is not available on the link you provide. — ℕʘʘḆḽḘ, Apr 10 '18 at 22:24

G. Grothendieck · Accepted Answer · 2018-04-11T15:49:45.313

13

Define a function Coef whose argument is formed from cbind(y, x) and which regresses y on x with an intercept, returning the coefficients. Then apply rollapplyr using the current and prior rows over each group. If by last you meant the 2 prior rows to the current row, i.e. exclude the current row, then replace 2 with list(-seq(2)) as an argument to rollapplyr.

Coef <- . %>% as.data.frame %>% lm %>% coef

mydata %>% 
  group_by(group) %>% 
  do(cbind(reg_col = select(., y, x) %>% rollapplyr(2, Coef, by.column = FALSE, fill = NA),
           date_col = select(., date))) %>%
  ungroup

giving:

# A tibble: 8 x 4
  group `reg_col.(Intercept)` reg_col.x date      
  <chr>                 <dbl>     <dbl> <date>    
1 a      NA                      NA     2016-06-01
2 a       0                       0.500 2016-06-02
3 a       0                       0.500 2016-06-03
4 a       0                       0.500 2016-06-04
5 b      NA                      NA     2016-06-03
6 b       0.00000000000000126     0.333 2016-06-04
7 b     - 0.00000000000000251     0.333 2016-06-05
8 b       0                       0.333 2016-06-06

Variation

A variation of the above would be:

mydata %>% 
       group_by(group) %>% 
       do(select(., date, y, x) %>% 
          read.zoo %>% 
          rollapplyr(2, Coef, by.column = FALSE, fill = NA) %>%
          fortify.zoo(names = "date")
       ) %>% 
       ungroup

Slope Only

If only the slope is needed there are further simplifications possible. We use the fact that the slope equals cov(x, y) / var(x).

slope <- . %>% { cov(.[, 2], .[, 1]) / var(.[, 2])}
mydata %>%
       group_by(group) %>%
       mutate(slope = rollapplyr(cbind(y, x), 2, slope, by.column = FALSE, fill = NA)) %>%
       ungroup

edited Apr 11 '18 at 15:49

answered Apr 11 '18 at 00:12

G. Grothendieck

254,981
17
203
341

really cool but I am puzzled by the crazy small coefficients here? Also, is it possible to have a proper column naming? – ℕʘʘḆḽḘ Apr 11 '18 at 00:23
1

The small coefficients are basically 0 and result from floating point approximations -- nothing to do with rollapply or dplyr. I updated the code and output a few times and the names should be ok now. – G. Grothendieck Apr 11 '18 at 00:25
Thanks Grothendieck. I now realize there is some important missing piece here (my fault). As is, the code does not allow me to merge back the regression results to the original dataset because the row index information is lost (we only select(y,x) in your code. I have updated the example dataframe in my question. Do you see an easy way to update your nice answer so that both the group AND the date appear in the output dataframe? That way we can merge back if needed. thanks!! – ℕʘʘḆḽḘ Apr 11 '18 at 01:42
I tried this, but I wonder if the data will always be correctly aligned here ```mydata %>% group_by(group) %>% do(cbind(reg_col = select(., y, x) %>% rollapply(2, Coef, by.column = FALSE, align = 'right', fill = NA), date_col = select(., date))) %>% ungroup ``` – ℕʘʘḆḽḘ Apr 11 '18 at 02:14
1

Yes, the data will be aligned if you use `fill=NA`. That fills in the otherwise missing rows with NA values. (Also suggest using `rollapplyr` with an r at the end as in my answer instead of `align = "right"`to keep code shorter.) – G. Grothendieck Apr 11 '18 at 02:30
crazy stuff I wonder which one is faster and more memory efficient with a large dataset – ℕʘʘḆḽḘ Apr 11 '18 at 02:49

johnson-shuffle · Answer 2 · 2018-04-10T23:19:59.743

2

This is more of an idea than an answer but maybe instead of using group_by try using map and your list of groups:

FUN <- function(g, df = NULL) {
  tmp <- tidy(rollapply(
    zoo(filter(df, group == g)),
    width = 2,
    FUN = function(z) {
      t <- lm(y ~ x, data = as.data.frame(z)) ; return(t$coef)
    },
    by.column = FALSE,
    align = "right"
    ))
  tmp$series <- c(rep('intercept', nrow(tmp) / 2), rep('slope', nrow(tmp) / 2))
  spread(tmp, series, value) %>% mutate(group = g)
}

map_dfr(list('a', 'b'), FUN, df = data)

edited Apr 10 '18 at 23:19

answered Apr 10 '18 at 22:26

johnson-shuffle

1,023
5
11

that is interesting. can you please try to make it work in that context? – ℕʘʘḆḽḘ Apr 10 '18 at 22:27
@ℕʘʘḆḽḘ I added a few extra lines to return a cleaner result. I'm honestly not familiar with `zoo`. The results seem odd. Shouldn't the slopes be 0.5 and 0.33? – johnson-shuffle Apr 10 '18 at 23:10
please have a look at the updated question. thanks!! – ℕʘʘḆḽḘ Apr 10 '18 at 23:55

Luke C · Answer 3 · 2018-04-11T17:18:33.790

2

Does this do what you're after?

data %>% 
  group_by(group) %>% 
  do(data.frame(., rolling_coef = c(NA, rollapply(data = ., width = 2, FUN = function(df_) {
    d = data.frame(df_)
    d[, 2:3] <- apply(d[,2:3], MARGIN = 2, FUN = as.numeric)
    mod = lm(y ~ x, data = d)
    return(coef(mod)[2])
  }, by.column = FALSE, align = "right"))))

Giving:

# A tibble: 8 x 4
# Groups:   group [2]
  group     y     x rolling_coef
  <chr> <dbl> <dbl>        <dbl>
1 a        1.    2.       NA    
2 a        2.    4.        0.500
3 a        3.    6.        0.500
4 a        4.    8.        0.500
5 b        2.    6.       NA    
6 b        3.    9.        0.333
7 b        4.   12.        0.333
8 b        5.   15.        0.333

Edit: Slightly modified code, but data_frame will not accept the . group placeholder as an argument- not sure how to fix that.

data %>% 
  group_by(group) %>% 
  do(data.frame(., rolling_coef = c(NA, rollapplyr(data = ., width = 2, FUN = function(df_) {
    mod = lm(y ~ x, data = .)
    return(coef(mod)[2])
  }, by.column = FALSE))))

Edit 2: Using fill = NA rather than using c(NA, ...) achieves the same result.

data %>% 
  group_by(group) %>% 
  do(data.frame(., rolling_coef = rollapplyr(data = ., width = 2, FUN = function(df_) {
    mod = lm(y ~ x, data = .)
    return(coef(mod)[2])
  }, by.column = FALSE, fill = NA)))

edited Apr 11 '18 at 17:18

answered Apr 11 '18 at 00:11

Luke C

10,081
1
14
21

I just have a few questions. what is the purpose of the `NA` here? why are you using `margin`? cant you use `dplyr::data_frame` and `purrr:map` instead of `data.frame` and `apply` in your code? – ℕʘʘḆḽḘ Apr 11 '18 at 01:41
@ℕʘʘḆḽḘ - The `NA` is present because the first observation has no "previous 2" observations. I had assumed you wanted the regression to be on rows 1 and 2, then on 2 and 3, then on 3 and 4. But, there is no such regression for the first row since the regression is not performed for row 0 and 1. I can't figure out how to get `data_frame` to work, but actually the `apply` is unnecessary (I forgot to drop it from my first attempt). See edit. – Luke C Apr 11 '18 at 03:05
ahh that's too bad... i dont think its safe to manually add this NA. I was hoping this could be automated. see Grothendieck s answer maybe? thanks anyway!! – ℕʘʘḆḽḘ Apr 11 '18 at 10:25
@ℕʘʘḆḽḘ I really like Grothendieck's answer, it's slick and I think provides more easily flexible output! As far as I understand it, using `c(NA, ...)` is functionally the same as using `fill = NA` in any case where you are rolling from the first row- shown in edit 2 above. – Luke C Apr 11 '18 at 17:22

score 2 · Answer 4 · answered Jul 08 '18 at 12:05

Here is a solution similar to G. Grothendieck's answer but using the rollRegres package. I have to increase the width argument to 3 to avoid an error (by the way, why do you want a regression with so few observations?)

library(rollRegres)
Coef <- . %>% { roll_regres.fit(x = cbind(1, .$x), y = .$y, width = 2L)$coefs }

mydata %>%
  group_by(group) %>%
  do(cbind(reg_col = select(., y, x) %>% Coef,
           date_col = select(., date))) %>%
  ungroup
#R  Error in mydata %>% group_by(group) %>% do(cbind(reg_col = select(., y,  :
#R    Assertion on 'width' failed: All elements must be >= 3.

# change width to avoid error
Coef <- . %>% { roll_regres.fit(x = cbind(1, .$x), y = .$y, width = 3L)$coefs }
mydata %>%
  group_by(group) %>%
  do(cbind(reg_col = select(., y, x) %>% Coef,
           date_col = select(., date))) %>%
    ungroup
#R # A tibble: 8 x 4
#R group  reg_col.1 reg_col.2 date
#R <chr>      <dbl>     <dbl> <date>
#R   1 a      NA           NA     2016-06-01
#R 2 a      NA           NA     2016-06-02
#R 3 a       1.54e-15     0.500 2016-06-03
#R 4 a      -5.13e-15     0.5   2016-06-04
#R 5 b      NA           NA     2016-06-03
#R 6 b      NA           NA     2016-06-04
#R 7 b      -3.08e-15     0.333 2016-06-05
#R 8 b      -4.62e-15     0.333 2016-06-06
#R Warning messages:
#R 1: In evalq((function (..., call. = TRUE, immediate. = FALSE, noBreaks. = FALSE,  :
#R    low sample size relative to number of parameters
#R 2: In evalq((function (..., call. = TRUE, immediate. = FALSE, noBreaks. = FALSE,  :
#R    low sample size relative to number of parameters

Have you figured out how to parallelize this? I got this to work for a dataframe with 68k observations but it's going to take an hour to run! — philiporlando, Sep 12 '18 at 23:56
You can use the `parallel` package and particularly `clusterCall` and `clusterEvalQ` kinda like [here](https://stackoverflow.com/a/31075231/5861244) to split the data sets into unique sets of `groups` on each process. — Benjamin Christoffersen, Sep 13 '18 at 07:03

rolling regression by group in the tidyverse?

4 Answers4

Variation

Slope Only

Linked