0

I am working on a regression script. I have a data.frame with roughly 130 columns, of which I need to do a regression for one column (lets call it X column) against all the other ~100 numeric columns.

Before the regression is calculated, I need to group the data by 4 factors: myDat$Recipe, myDat$Step, myDat$Stage, and myDat$Prod while still keeping the other ~100 columns and row data attached for the regression. Then I need to do a regression of each column ~ X column and print out the R^2 value with the column name. This is what I've tried so far but it is getting overly complicated and I know there's got to be a better way.

 rm(list=ls())
 myDat <- read.csv(file="C:/Users/Documents/myDat.csv",              header=TRUE, sep=",")

for(j in myDat$Recipe)
{
  myDatj <- subset(myDat, myDat$Recipe == j) 
  for(k in myDatj$Step)
  {
    myDatk <- subset(myDatj, myDatj$Step == k) 
    for(i in myDatk$Stage)
    {
      myDati <- subset(myDatk, myDatk$Stage == i)
      for(m in myDati$Prod)
      {
        myDatm <- subset(myDati, myDati$Prod == m)
          if(is.numeric(myDatm[3,i]))  
          {     
          fit <- lm(myDatk[,i] ~ X, data=myDatm) 
          rsq <- summary(fit)$r.squared
            {
              writeLines(paste(rsq,i,"\n"))
           }  
         }
      }
    }
  }  
}      
josliber
  • 43,891
  • 12
  • 98
  • 133
Jacob Odom
  • 216
  • 1
  • 8
  • [How to make a great R reproducible example?](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), also read about [apply](http://www.ats.ucla.edu/stat/r/library/advanced_function_r.htm). – zx8754 Jun 03 '15 at 14:44

1 Answers1

0

You can do this by combining dplyr, tidyr and my broom package (you can install them with install.packages). First you need to gather all the numeric columns into a single column:

library(dplyr)
library(tidyr)
tidied <- myDat %>%
    gather(column, value, -X, -Recipe, -Step, -Stage, -Prod)

To understand what this does, you can read up on tidyr's gather operation. (This assumes that all columns besides X, Recipe, Step, Stage, and Prod are numeric and therefore should be predicted in your regression. If that's not the case, you need to remove them beforehand. You'll need to produce a reproducible example of the problem if you need a more customized solution).

Then perform each regression, while grouping by the column and the four grouping variables.

library(broom)

regressions <- tidied %>%
    group_by(column, Recipe, Step, Stage, Prod) %>%
    do(mod = lm(value ~ X))

glances <- regressions %>% glance(mod)

The resulting glances data frame will have one row for each combination of column, Recipe, Step, Stage, and Prod, along with an r.squared column containing the R-squared from each model. (It will also contain adj.r.squared, along with other columns such as F-test p-value: see here for more). Running coefs <- regressions %>% tidy(mod) will probably also be useful for you, as it will get the coefficient estimates and p-values from each regression.

A similar use case is described in the "broom and dplyr" vignette, and in Section 3.1 of the broom manuscript.

David Robinson
  • 77,383
  • 16
  • 167
  • 187
  • Have you ever run across the following error when using the glance line? "Error in data.frame(r.squared = r.squared, adj.r.squared = adj.r.squared, : object 'fstatistic' not found" agument and tidy work on the s3 object, just not glance.. – Jacob Odom Jul 27 '15 at 17:51
  • @JacobOdom I haven't run across that. Would it be possible to create a reproducible example, and ask it as a new question? Please ping me when you do. – David Robinson Jul 27 '15 at 21:55
  • Just posted it [Here](http://stackoverflow.com/questions/31818475/broom-dplyr-error-with-glance-when-using-lm-instead-of-biglm)! – Jacob Odom Aug 04 '15 at 20:37