Regression in R with Groups

Question

I have imported a CSV with 3 columns , 2 columns for Y and X and the third column which identifies the category for X ( I have 20 groups/categories). I am able to run a regression at overall level but I want to run regression for the 20 categories separately and store the co-efs.

I tried the following :

list2env(split(sample, sample$CATEGORY_DESC), envir = .GlobalEnv)

Now I have 20 files, how do I run a regression on these 20 files and store the co-effs somewhere.

Is it absolutely necessary to deal with 20 different files? Consider merging them into a single data-frame (each with a unique identifier) and use available functions like `lmList` (in package nlme) to run all the lm simultaneously. Way easier to get the coeff that way. — Adam Quek, May 18 '16 at 00:59

score 3 · Answer 1 · answered May 18 '16 at 01:05

Since no data was provided, I am generating some sample data to show how you can run multiple regressions and store output using dplyr and broom packages.

In the following, there are 20 groups and different x/y values per group. 20 regressions are run and output of these regressions is provided as a data frame:

library(dplyr)
library(broom)
df <- data.frame(group = rep(1:20, 10),
                 x = rep(1:20, 10) + rnorm(200),
                 y = rep(1:20, 10) + rnorm(200))
df %>% group_by(group) %>% do(tidy(lm(x ~ y, data = .)))

Sample output:

Source: local data frame [40 x 6]
Groups: group [20]

   group        term    estimate std.error  statistic     p.value
   <int>       <chr>       <dbl>     <dbl>      <dbl>       <dbl>
1      1 (Intercept)  0.42679228 1.0110422  0.4221310 0.684045203
2      1           y  0.45625124 0.7913256  0.5765657 0.580089051
3      2 (Intercept)  1.99367392 0.4731639  4.2134955 0.002941805
4      2           y  0.05101438 0.1909607  0.2671460 0.796114398
5      3 (Intercept)  3.14391308 0.8417638  3.7349114 0.005747126
6      3           y  0.08418715 0.2453441  0.3431391 0.740336702

I am getting the following error- Error: invalid subscript type 'integer' — sai, May 18 '16 at 01:19
`> packageVersion('dplyr') [1] ‘0.4.3.9001’ > packageVersion('broom') [1] ‘0.4.0’` — Gopala, May 18 '16 at 01:32
If you are using older version of `dplyr`, you may want to upgrade to dev version as there are some bugs in it. — Gopala, May 18 '16 at 01:33

score 2 · Accepted Answer · answered May 18 '16 at 01:23

Quick solution with lmList (package nlme):

library(nlme)
lmList(x ~ y | group, data=df)

    Call:
  Model: x ~ y | group 
   Data: df 

Coefficients:
   (Intercept)           y
1    0.4786373  0.04978624
2    3.5125369 -0.94751894
3    2.7429958 -0.01208329
4   -5.2231576  2.24589181
5    5.6370824 -0.24223131
6    7.1785581 -0.08077726
7    8.2060808 -0.18283134
8    8.9072851 -0.13090764
9   10.1974577 -0.18514527
10   6.0687105  0.37396911
11   9.0682622  0.23469187
12  15.1081915 -0.29234452
13  17.3147636 -0.30306692
14  13.1352411  0.05873189
15   6.4006623  0.57619151
16  25.4454182 -0.59535396
17  22.0231916 -0.30073768
18  27.7317267 -0.54651597
19  10.9689733  0.45280604
20  23.3495704 -0.14488522

Degrees of freedom: 200 total; 160 residual
Residual standard error: 0.9536226

Borrowed the data df from @Gopala answer.

I am experiencing an opposite effect when I dig deeper vs overall results..what is that called ? I am blanking on the effect name ! — sai, May 19 '16 at 07:08

Parfait · Answer 3 · 2016-05-18T03:45:55.603

0

Consider also a base solution with lapply():

regressionList <- lapply(unique(df$group),
                         function(x) lm(x ~ y, df[df$group==x,]))

And only the coefficients:

coeffList <- lapply(unique(df$group),
                    function(x) lm(x ~ y, df[df$group==x,])$coefficients)

Even list of summaries:

summaryList <- lapply(unique(df$group),
                      function(x) summary(lm(x ~ y, df[df$group==x,])))

edited May 18 '16 at 03:45

answered May 18 '16 at 03:40

Parfait

104,375
17
94
125

1

Would be faster to run `coeffList<-lapply(regressionList, coef)` and `summaryList <-lapply(regressionList, summary)` instead. – Adam Quek May 18 '16 at 09:43
Indeed @AdamQuek. No need to re-run model. All options available to OP! – Parfait May 18 '16 at 13:30

Regression in R with Groups

3 Answers3