How to Loop linear regression repeatedly for n rows in R?

Question

I am currently working on a dataset that looks like this

Streamflow	x1	x2	x3	x4
145	24	25	43	44
198	30	45	66	54
285	32	39	54	47
...	...	...	...	..

all the way down 4408 rows in total. What I want to do is conduct linear regression of streamflow ~ x1 + x2 + x3 + x4 from row 1 to row 20, then from row 2 to row 21, and the from row 3 to row 22, till the end so that I can get a set of coefficient every run. I know that I might need forto start the loop, but I just cannot figure out how to have it work on every 20th rows. Any suggestion will be really appreciated. Thank you in advance.

for(i in 1:nrow(CFbasin)) {
  y <- CFbasin[i:(i+20), 2]
  x1 <- CFbasin[i:(i+20), 3]
  x2 <- CFbasin[i:(i+20), 4]
  mod_coef[i] <- coef(lm(y ~ x1 + x2))
}

So this is what I wrote and it doesn't give me the ideal results

You probably do not need a loop, but you should provide enough of your data using `dput()` that we can demonstrate how to do what you want. Also 4408 is not divisible by 20 so what happens to the leftover 8 rows of data? As a start, `idx <- rep(1:220, each=20)` would identify 220 groups of 20 lines each. — dcarlson, Nov 07 '21 at 19:44
Maybe I didn't say that clear enough, my bad, the reason why I think I need `for` loop is that I need 1:20 to regression, then 2:21, and then 3:22 and repeat this process so on. — Rua Jing, Nov 07 '21 at 19:47
see rolling regression https://stackoverflow.com/questions/23162937/the-rolling-regression-in-r-using-roll-apply/52457815 — Iman, Nov 07 '21 at 19:49

score 1 · Answer 1 · answered Nov 07 '21 at 20:02

You could do something like this, if I understand correctly what you mean:

Until this line mutate(group20rows = as.integer(gl(n(), 20, n())), .before=1) %>% it is only preparation of fake data
Create a sequence of 20 with your desired column.
group and apply group_split, then you have these groups of data each 20
apply map_dfr to iterate over each group your regression
use glance() from broom package to show all in one dataframe in a tidy from

library(tidyverse)
library(broom)

colnames1 <- c("streamflow", "x1", "x2", "x3", "x4")

iris %>% 
  select(-5) %>% 
  mutate(Sepal.Length1 = Sepal.Length) %>% 
  rename_with(~colnames1) %>% 
  mutate(streamflow=streamflow*100) %>% 
  mutate(group20rows = as.integer(gl(n(), 20, n())), .before=1) %>% 
  mutate(group20rows = as_factor(group20rows)) %>% 
  group_by(group20rows) %>% 
  group_split() %>% 
  map_dfr(.f = function(df){
    lm(streamflow ~ x1+x2+x3, data = df) %>% 
      glance() %>% 
      add_column(group20rows = unique(df$group20rows), .before=1)
  })

output:

  group20rows r.squared adj.r.squared sigma statistic     p.value    df logLik   AIC   BIC deviance df.residual  nobs
  <fct>           <dbl>         <dbl> <dbl>     <dbl>       <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1 1               0.808         0.772  20.4     22.4  0.00000567      3  -86.5 183.   188.    6662.          16    20
2 2               0.335         0.210  26.2      2.69 0.0815          3  -91.4 193.   198.   10961.          16    20
3 3               0.873         0.850  32.0     36.8  0.000000207     3  -95.5 201.   206.   16420.          16    20
4 4               0.511         0.419  34.4      5.57 0.00819         3  -96.9 204.   209.   18919.          16    20
5 5               0.530         0.442  30.0      6.01 0.00609         3  -94.2 198.   203.   14417.          16    20
6 6               0.877         0.854  28.0     38.0  0.000000165     3  -92.8 196.   200.   12502.          16    20
7 7               0.787         0.747  32.5     19.7  0.0000128       3  -95.8 202.   207.   16895.          16    20
8 8               0.644         0.466  28.0      3.62 0.0845          3  -45.0  99.9  101.    4719.           6    10

score 1 · Accepted Answer · answered Nov 07 '21 at 20:28

Here is a base R approach using the iris data set also:

data(iris)
str(iris)
# 'data.frame': 150 obs. of  5 variables:
#  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Since there are 150 rows, we compute the number of rolling groups of 20 and use that to create a matrix with 20 rows and 131 columns listing the row numbers to be used in each regression:

rows <- nrow(iris)
last <- rows + 1 - 20
idx <- sapply(1:last, seq, length.out=20)
str(idx)
#  num [1:20, 1:131] 1 2 3 4 5 6 7 8 9 10 ...

So we have 131 columns and each column identifies a group of 20 rows for a regression. Now compute the 131 regressions and save the coefficients:

results <- lapply(1:131, function(x) lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, iris[idx[, x], ]))
coeffs <- t(sapply(results, coef))
head(coeffs)
#      (Intercept) Sepal.Width Petal.Length Petal.Width
# [1,]  0.88165253   1.1027541    0.4335847  -1.3039612
# [2,]  0.64094220   1.1111668    0.6186075  -1.4860753
# [3,]  0.28030724   1.2120241    0.6477881  -1.7022181
# [4,] -0.01943516   1.1879500    0.8971728  -1.6773764
# [5,]  0.46106345   0.9888293    0.9230228  -0.8457783
# [6,]  0.92206667   0.9734378    0.5716684  -0.5058189

Each regression is stored as a list in results so that the first regression is results[[1]].

summary(results[[1]])
# 
# Call:
# lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
#     data = iris[idx[, x], ])
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -0.26396 -0.17137 -0.00562  0.13582  0.36386 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    0.8817     0.6730   1.310    0.209    
# Sepal.Width    1.1028     0.1748   6.309 1.04e-05 ***
# Petal.Length   0.4336     0.3448   1.257    0.227    
# Petal.Width   -1.3040     0.7924  -1.646    0.119    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.204 on 16 degrees of freedom
# Multiple R-squared:  0.8078,  Adjusted R-squared:  0.7717 
# F-statistic: 22.41 on 3 and 16 DF,  p-value: 5.666e-06

Getting a statistic computed by summary is slightly more involved:

Rsq <- sapply(results, function(x) summary(x)$adj.r.squared)
# quantile(Rsq)
#        0%       25%       50%       75%      100% 
# 0.1635166 0.4471409 0.6298927 0.8417655 0.9278258

How to Loop linear regression repeatedly for n rows in R?

2 Answers2