0

I have imported a CSV with 3 columns , 2 columns for Y and X and the third column which identifies the category for X ( I have 20 groups/categories). I am able to run a regression at overall level but I want to run regression for the 20 categories separately and store the co-efs.

I tried the following :

list2env(split(sample, sample$CATEGORY_DESC), envir = .GlobalEnv)

Now I have 20 files, how do I run a regression on these 20 files and store the co-effs somewhere.

sai
  • 313
  • 2
  • 7
  • 19
  • Is it absolutely necessary to deal with 20 different files? Consider merging them into a single data-frame (each with a unique identifier) and use available functions like `lmList` (in package nlme) to run all the lm simultaneously. Way easier to get the coeff that way. – Adam Quek May 18 '16 at 00:59

3 Answers3

3

Since no data was provided, I am generating some sample data to show how you can run multiple regressions and store output using dplyr and broom packages.

In the following, there are 20 groups and different x/y values per group. 20 regressions are run and output of these regressions is provided as a data frame:

library(dplyr)
library(broom)
df <- data.frame(group = rep(1:20, 10),
                 x = rep(1:20, 10) + rnorm(200),
                 y = rep(1:20, 10) + rnorm(200))
df %>% group_by(group) %>% do(tidy(lm(x ~ y, data = .)))

Sample output:

Source: local data frame [40 x 6]
Groups: group [20]

   group        term    estimate std.error  statistic     p.value
   <int>       <chr>       <dbl>     <dbl>      <dbl>       <dbl>
1      1 (Intercept)  0.42679228 1.0110422  0.4221310 0.684045203
2      1           y  0.45625124 0.7913256  0.5765657 0.580089051
3      2 (Intercept)  1.99367392 0.4731639  4.2134955 0.002941805
4      2           y  0.05101438 0.1909607  0.2671460 0.796114398
5      3 (Intercept)  3.14391308 0.8417638  3.7349114 0.005747126
6      3           y  0.08418715 0.2453441  0.3431391 0.740336702
Gopala
  • 10,363
  • 7
  • 45
  • 77
  • I am getting the following error- Error: invalid subscript type 'integer' – sai May 18 '16 at 01:19
  • `> packageVersion('dplyr') [1] ‘0.4.3.9001’ > packageVersion('broom') [1] ‘0.4.0’` – Gopala May 18 '16 at 01:32
  • If you are using older version of `dplyr`, you may want to upgrade to dev version as there are some bugs in it. – Gopala May 18 '16 at 01:33
2

Quick solution with lmList (package nlme):

library(nlme)
lmList(x ~ y | group, data=df)

    Call:
  Model: x ~ y | group 
   Data: df 

Coefficients:
   (Intercept)           y
1    0.4786373  0.04978624
2    3.5125369 -0.94751894
3    2.7429958 -0.01208329
4   -5.2231576  2.24589181
5    5.6370824 -0.24223131
6    7.1785581 -0.08077726
7    8.2060808 -0.18283134
8    8.9072851 -0.13090764
9   10.1974577 -0.18514527
10   6.0687105  0.37396911
11   9.0682622  0.23469187
12  15.1081915 -0.29234452
13  17.3147636 -0.30306692
14  13.1352411  0.05873189
15   6.4006623  0.57619151
16  25.4454182 -0.59535396
17  22.0231916 -0.30073768
18  27.7317267 -0.54651597
19  10.9689733  0.45280604
20  23.3495704 -0.14488522

Degrees of freedom: 200 total; 160 residual
Residual standard error: 0.9536226

Borrowed the data df from @Gopala answer.

Adam Quek
  • 6,973
  • 1
  • 17
  • 23
  • I am experiencing an opposite effect when I dig deeper vs overall results..what is that called ? I am blanking on the effect name ! – sai May 19 '16 at 07:08
0

Consider also a base solution with lapply():

regressionList <- lapply(unique(df$group),
                         function(x) lm(x ~ y, df[df$group==x,]))

And only the coefficients:

coeffList <- lapply(unique(df$group),
                    function(x) lm(x ~ y, df[df$group==x,])$coefficients)

Even list of summaries:

summaryList <- lapply(unique(df$group),
                      function(x) summary(lm(x ~ y, df[df$group==x,])))
Parfait
  • 104,375
  • 17
  • 94
  • 125