Running several year linear regressions from a single dataframe in R and storing coefficients into new dataframe

Question

i have the following df:

 df <- data.frame(
      Company = rep(LETTERS[1:10], each = 10),
      Year = rep(2010:2019, times = 10),
      Var1 = runif(100, -1, 1),
      Var2 = runif(100, -1, 1),
      Var3 = runif(100, -1, 1),
      Var4 = runif(100, -1, 1),
      Var5 = runif(100, -1, 1)
    )

I am trying to run 10 ols regressions per year and adding all the coefficients, standard errors, t-statistic and p value into one big new dataframe. I came across this link and tried the following: Running several linear regressions from a single dataframe in R

coef_matrix <- df %>% group_by(Year) %>% do(tidy(lm(Var1 ~ Var2 + Var3 + Var4 + Var5, data=df)))

However, this gives me this output:

head(coef_matrix,15)
# A tibble: 15 x 6
# Groups:   Year [3]
    Year term        estimate std.error statistic p.value
   <int> <chr>          <dbl>     <dbl>     <dbl>   <dbl>
 1  2010 (Intercept)   0.0274    0.0591     0.463  0.644 
 2  2010 Var2         -0.0339    0.113     -0.301  0.764 
 3  2010 Var3         -0.0705    0.112     -0.628  0.532 
 4  2010 Var4          0.0663    0.0973     0.682  0.497 
 5  2010 Var5          0.201     0.0983     2.04   0.0440
 6  2011 (Intercept)   0.0274    0.0591     0.463  0.644 
 7  2011 Var2         -0.0339    0.113     -0.301  0.764 
 8  2011 Var3         -0.0705    0.112     -0.628  0.532 
 9  2011 Var4          0.0663    0.0973     0.682  0.497 
10  2011 Var5          0.201     0.0983     2.04   0.0440
11  2012 (Intercept)   0.0274    0.0591     0.463  0.644 
12  2012 Var2         -0.0339    0.113     -0.301  0.764 
13  2012 Var3         -0.0705    0.112     -0.628  0.532 
14  2012 Var4          0.0663    0.0973     0.682  0.497 
15  2012 Var5          0.201     0.0983     2.04   0.0440
>

I want to try and run 10 annual regressions for every year and then store all their coefficients, standard errors etcetera into a dataframe. Right now the coefficients are the same as if i would run a normal ols regression for all years together:

summary(lm(Var1 ~ Var2 + Var3 + Var4 + Var5, data=df))

Call:
lm(formula = Var1 ~ Var2 + Var3 + Var4 + Var5, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.10859 -0.51344 -0.05597  0.49481  1.13573 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.02736    0.05909   0.463    0.644  
Var2        -0.03392    0.11252  -0.301    0.764  
Var3        -0.07045    0.11223  -0.628    0.532  
Var4         0.06633    0.09730   0.682    0.497  
Var5         0.20064    0.09828   2.041    0.044 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5851 on 95 degrees of freedom
Multiple R-squared:  0.05331,   Adjusted R-squared:  0.01345 
F-statistic: 1.338 on 4 and 95 DF,  p-value: 0.2617

How do i change this? I thought the group_by function was supposed to run the annual regression, but it seems like i am making a mistake. So for every year, the coefficients, standard errors etcetera should be different (of course). So for 2010, the output should be this in the newly created df:

summary(lm(Var1 ~ Var2 + Var3 + Var4 + Var5, data=df[df$Year=="2010", ]))
Call:
lm(formula = Var1 ~ Var2 + Var3 + Var4 + Var5, data = df[df$Year == 
    "2010", ])

Residuals:
       1       11       21       31       41       51       61       71       81       91 
-0.04871 -0.08103 -0.79737  0.10159  0.18243  0.41049 -0.05951 -0.12389  0.18528  0.23072 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -0.3293     0.2223  -1.481   0.1986  
Var2          0.8657     0.3569   2.426   0.0597 .
Var3          0.5305     0.4735   1.120   0.3134  
Var4         -0.2685     0.2446  -1.097   0.3225  
Var5         -0.1988     0.3729  -0.533   0.6167  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4389 on 5 degrees of freedom
Multiple R-squared:  0.612, Adjusted R-squared:  0.3017 
F-statistic: 1.972 on 4 and 5 DF,  p-value: 0.2372

And this for 2011, and so on:

Call:
lm(formula = Var1 ~ Var2 + Var3 + Var4 + Var5, data = df[df$Year == 
    "2011", ])

Residuals:
        2        12        22        32        42        52        62        72        82        92 
-0.541020 -0.626232  0.627003  0.326570  0.450827 -0.505888 -0.008373  0.264653  0.190374 -0.177914 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.23500    0.27901   0.842    0.438
Var2        -0.49973    0.59313  -0.843    0.438
Var3         0.21137    0.55453   0.381    0.719
Var4         0.34993    0.40960   0.854    0.432
Var5         0.09965    0.37772   0.264    0.802

Residual standard error: 0.597 on 5 degrees of freedom
Multiple R-squared:  0.2734,    Adjusted R-squared:  -0.3079 
F-statistic: 0.4703 on 4 and 5 DF,  p-value: 0.7577

Just change `data=df` to `data=.` in your creation of `coef_matrix` — langtang, May 29 '23 at 17:58
@langtang sometimes the use of `.` is discouraged whenever you are running a model. It can lead to inacurate results. eg try `lm(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, iris, Species == "setosa")` then try `iris %>% group_by(Species) %>% summarise(broom::tidy(lm(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, data= .))) %>% filter(Species == 'setosa')` Note that the coefficients are not the same. Now change the `.` to `cur_data()` and you will note that the coefficients are the same. Meaning the use of `.` leads to incorrect results — Onyambu, May 29 '23 at 18:05

score 2 · Answer 1 · answered May 29 '23 at 17:59

df %>%
  group_by(Year) %>% 
  summarise(broom::tidy(lm(Var1 ~ Var2 + Var3 + Var4 + Var5, data= cur_data_all())))

# A tibble: 50 × 6
# Groups:   Year [10]
    Year term        estimate std.error statistic p.value
   <int> <chr>          <dbl>     <dbl>     <dbl>   <dbl>
 1  2010 (Intercept)    0.224     0.383     0.584   0.585
 2  2010 Var2           0.643     0.689     0.933   0.394
 3  2010 Var3          -0.251     0.512    -0.490   0.645
 4  2010 Var4          -0.386     0.653    -0.592   0.580
 5  2010 Var5          -0.222     0.388    -0.574   0.591
 6  2011 (Intercept)    0.243     0.270     0.898   0.410
 7  2011 Var2          -0.300     0.446    -0.673   0.531
 8  2011 Var3           0.401     0.418     0.960   0.381
 9  2011 Var4           0.424     0.330     1.29    0.255
10  2011 Var5          -0.465     0.291    -1.60    0.171

If you havedplyr>=1.1 then consider using reframe instead of summarise and use pick(Var1:Var5) instead of cur_data()

score 0 · Answer 2 · answered May 29 '23 at 19:23

Here is a data.table implementation:

library(data.table)

setDT(df)[, as.data.table(
  summary(lm(Var1~Var2+Var3+Var4+Var5, data=.SD))$coef,
  keep.rownames = T),
  Year
]

Output:

     Year          rn      Estimate Std. Error      t value   Pr(>|t|)
    <int>      <char>         <num>      <num>        <num>      <num>
 1:  2010 (Intercept) -0.0185139123  0.1745433 -0.106070604 0.91965050
 2:  2010        Var2  0.8925268150  0.3074344  2.903145262 0.03366718
 3:  2010        Var3  0.2056756465  0.2588893  0.794453941 0.46295473
 4:  2010        Var4  0.1415611201  0.2333638  0.606611315 0.57059491
 5:  2010        Var5  0.5847579888  0.2458589  2.378429109 0.06328800
 6:  2011 (Intercept)  0.0226610822  0.2697575  0.084005375 0.93631186
 7:  2011        Var2  0.1713025466  0.4501861  0.380514940 0.71919278
 8:  2011        Var3 -0.0365874385  0.4618825 -0.079213730 0.93993524
 9:  2011        Var4  0.4523682769  0.3970092  1.139440262 0.30614687
10:  2011        Var5  0.0754535192  0.4244909  0.177750631 0.86589569
11:  2012 (Intercept) -0.0959599437  0.2731497 -0.351308981 0.73967566
12:  2012        Var2  0.3776595156  0.6546532  0.576884860 0.58901564
13:  2012        Var3 -0.0061169165  0.3507799 -0.017438049 0.98676160
14:  2012        Var4  0.0859049128  0.5806072  0.147957021 0.88815829
15:  2012        Var5 -0.0853547099  0.4864285 -0.175472270 0.86759347
16:  2013 (Intercept) -0.1408782361  0.1712474 -0.822659140 0.44814357
17:  2013        Var2  0.1446858677  0.3215907  0.449906923 0.67161187
18:  2013        Var3 -0.2353407216  0.3858692 -0.609897700 0.56858084
19:  2013        Var4 -0.1236695218  0.3080081 -0.401513870 0.70462777
20:  2013        Var5 -0.6461539283  0.2538933 -2.544982676 0.05157997
21:  2014 (Intercept)  0.1457589874  0.2361635  0.617195160 0.56412468
22:  2014        Var2 -0.1902867534  0.6787563 -0.280346220 0.79044108
23:  2014        Var3 -0.0664099633  0.5026280 -0.132125479 0.90003734
24:  2014        Var4  0.3274313087  0.4057963  0.806885835 0.45638230
25:  2014        Var5  0.0009242792  0.5080055  0.001819428 0.99861867
26:  2015 (Intercept) -0.2835090756  0.1837970 -1.542512028 0.18359216
27:  2015        Var2 -0.7611936157  0.3350701 -2.271744145 0.07228084
28:  2015        Var3 -0.0460680232  0.3194645 -0.144203885 0.89097155
29:  2015        Var4 -0.4157749562  0.2315036 -1.795976121 0.13244194
30:  2015        Var5  0.0744453929  0.2760067  0.269723121 0.79815100
31:  2016 (Intercept)  0.3445679145  0.2215998  1.554910686 0.18068667
32:  2016        Var2  0.2026557534  0.4119664  0.491923044 0.64360822
33:  2016        Var3  0.1997847174  0.3087333  0.647110955 0.54609126
34:  2016        Var4  0.0680401474  0.2937252  0.231645560 0.82599500
35:  2016        Var5 -0.0948129954  0.3593450 -0.263849455 0.80242521
36:  2017 (Intercept)  0.1813009784  0.1807830  1.002865311 0.36196037
37:  2017        Var2 -0.1404884248  0.1961502 -0.716228893 0.50589908
38:  2017        Var3  0.3181799345  0.2697474  1.179547527 0.29124163
39:  2017        Var4  0.7311305687  0.3620538  2.019397582 0.09944668
40:  2017        Var5  0.1843115867  0.5512637  0.334343762 0.75168854
41:  2018 (Intercept)  0.3315016731  0.1445671  2.293063654 0.07037946
42:  2018        Var2 -0.5653688206  0.3228250 -1.751316579 0.14028416
43:  2018        Var3  0.2804616367  0.2291430  1.223958846 0.27549148
44:  2018        Var4 -0.1027345987  0.2652794 -0.387269471 0.71449263
45:  2018        Var5 -0.4464118180  0.2198853 -2.030203222 0.09808576
46:  2019 (Intercept)  0.1793316586  0.3700288  0.484642375 0.64841488
47:  2019        Var2 -0.1862647333  0.5215482 -0.357138118 0.73556723
48:  2019        Var3  0.3001795075  0.4801942  0.625121014 0.55931008
49:  2019        Var4  0.3334414516  0.6686747  0.498660191 0.63917790
50:  2019        Var5 -0.5558946009  0.5209731 -1.067031261 0.33474572
     Year          rn      Estimate Std. Error      t value   Pr(>|t|)

Running several year linear regressions from a single dataframe in R and storing coefficients into new dataframe

2 Answers2