2

I am new to R and am trying to run a linear regression on multiple subsets ("Cases") of data in a single file. I have 50 different cases, so I don't want to have to run 50 different regressions...be nice to automate this. I have found and experimented with the ddply method, but this, for some reason, returns the same coefficients to me for each case. Code I'm using is as follows:

ddply(MyData, "Case", function(x) coefficients(lm(Y~X1+X2+X3, MyData)))

Results I get, again, are the same coefficients for each "Case". Any ideas on how I can improve my code so that the regression runs once for each case and gives me unique coefficients for each case?

Carter
  • 179
  • 3
  • 8

3 Answers3

7

ddply passes data.frames (from splitting the input data.frame) to the function. You probably want this:

ddply(MyData, "Case", function(df) coefficients(lm(Y~X1+X2+X3, data=df)))

(Not tested since you don't provide a reproducible example.)

You passed the whole input data.frame to lm for each group..

Roland
  • 127,288
  • 10
  • 191
  • 288
  • Awesome! It worked beautifully. Now...what just happened?! What does `df` mean when inside `function()`? I'm guessing it represents just the data that corresponds to each case? – Carter Aug 02 '13 at 20:35
  • I just named the function argument `df` to highlight that it is a data.frame. You could call it `x` if you prefer that. – Roland Aug 02 '13 at 20:36
  • 1
    I think you should give credit to @Roland in your identical question here http://stats.stackexchange.com/questions/66390/regression-by-subset-in-r, not just 'Found the solution', and link to this answer. – Henrik Aug 02 '13 at 20:59
  • Thanks for keeping me honest, Henrik. Credit is attributed to Roland on the Cross Validated post. – Carter Aug 02 '13 at 23:14
6

You can also do it using only base functions:

# load data
data(warpbreaks)

# fit linear model to each subset
fits <- by(warpbreaks, warpbreaks[,"tension"],
           function(x) lm(breaks ~ wool, data = x))

# Combine coefficients from each model
do.call("rbind", lapply(fits, coef))
Michael Malick
  • 1,838
  • 12
  • 6
  • Thanks, Michael. This also did a great job. Now I'm attempting to append the coefficients from each subset to the original table. (Obviously, these would be duplicates within each case) Any tips? – Carter Aug 03 '13 at 00:00
  • If by append you mean add a column to the original data frame giving the coefficients of each subset you can use the `merge()` function in base R or the `join()` function in the plyr package. – Michael Malick Aug 03 '13 at 01:31
  • 1
    Correct, Michael. I was looking to add a column to the original data frame with the model coefficients. Both functions worked perfectly. I'm slowly learning the in's and out's of this language! – Carter Aug 06 '13 at 13:56
0

Though you can get pretty far with (d)plyr, I get the most flexibility with a simple for loop (as I’m not too confident with the do() when it comes to wanting more than just the output of coefficients(), for example)

Let’s start by loading the data and loading some packages:

library(dplyr)
library(broom) # get lm output (coefficients) as a dataframe
library(reshape2) # Melt / DCAST for swapping between wide / long format
data(warpbreaks)  

Next, create a list of groups we’re going to be iterating over.

GROUPS <- unique(warpbreaks$tension) # Specify groups, one unique model per group.

The code to answer your question, in this case, would be something of the sorts:

for (i in 1:length(GROUPS)){
  CURRENT_GROUP <- GROUPS[i]
  df <- filter(warpbreaks, tension==CURRENT_GROUP) # subset the dataframe
  fit <- lm(breaks ~ wool, data = df) # Build a model
  coeff <- tidy(fit) # Get a pretty data frame of the coefficients & p values
  coeff <- coeff[,c(1,2,5)] # Extract P.Value & Estimate

  # Rename (intercept) to INT for pretty column names
  coeff[coeff$term=="(Intercept)", ]$term <- "INT"

  # Make it into wide format with reshape2 package.
  coeff <- coeff %>% melt(id.vars=c("term"))

  # Defactor the resulting data.frame
  coeff <- mutate(coeff,
                  variable=as.character(variable),
                  term=as.character(term))

  # Rename for prettier column names later
  coeff[coeff$variable=="estimate", ]$variable <- "Beta"
  coeff[coeff$variable=="p.value", ]$variable <- "P"

  coeff <- dcast(coeff, .~term+variable)[,-1]

  rsquared <- summary(fit)$r.squared

  # Create a df out of what we just did.
  row <- cbind(
    data.frame(
      group=CURRENT_GROUP,
      rsquared=rsquared),
    coeff
  )

  # If first iteration, create data.frame -- otherwise: rowbind
  if (i==1){
    RESULT_ROW = row
  }
  else{
    RESULT_ROW = rbind(RESULT_ROW, row)
  } # End if.
} # End for loop

When writing the inner for loop code, just test something out by simulating the loop (first run i <- 1, run the inner loop code step by step, then run i <- 2 etc.). The resulting dataframe:

RESULT_ROW  
##   group   rsquared INT_Beta        INT_P woolB_Beta    woolB_P
## 1     L 0.26107601 44.55556 9.010046e-08 -16.333333 0.03023435
## 2     M 0.07263228 24.00000 5.991177e-07   4.777778 0.27947884
## 3     H 0.12666292 24.55556 9.234773e-08  -5.777778 0.14719571

I’m not saying this is better than the answers described above, I just like the fact that this gives me the flexibility to extract anything from any model (not just lm models that have a pretty coefficients() function).

MattV
  • 1,353
  • 18
  • 42
  • The for loop approach is *extremely* slow for large data frames with many groups, because a) you're running a separate filter on every group (try `library(hflights); for (g in unique(hflights$TailNum)) filter(hflights, TailNum == g)`: ~35 seconds and it's not even doing anything but filter), and b) a series of sequential `rbind`s is slow because it has to copy the contents of the data frame to a new one every time. Thus this is not a good habit to get into. – David Robinson Apr 15 '15 at 12:32
  • 1
    Secondly, it isn't necessary. Why not just write your own function and use `do`? That could be done with `func <- function(df) {`, then everything between your `fit <- lm(...` line to `row <- cbind(`, then `return(row) }`. Then you could do `warpbreaks %>% group_by(tension) %>% do(func(.))` – David Robinson Apr 15 '15 at 12:34
  • I can definitely see how it's not efficient at all. Could you provide an example of how you'd do the same as what I just posted, without the for loop? Would appreciate the learning experience. – MattV Apr 15 '15 at 12:35
  • certainly, please see my second comment :) – David Robinson Apr 15 '15 at 12:38
  • @David Robinson how does your suggestion work with the`group=CURRENT_GROUP` argument of the code you suggest to include in `row <- cbind(data.frame( group=CURRENT_GROUP,rsquared=rsquared), coeff)` – Danielle Jul 19 '17 at 04:52