2

I have a data frame that looks like this. names and number of columns will NOT be consistent (sometimes 'C' will not be present, other times "D', 'E', 'F' may be present, etc.). The only consistent variable will always be Y, and I want to regress against Y.

# name and number of columns varies...so need flexible process
Y <- c(4, 4, 3, 4, 3, 2, 3, 2, 2, 3, 4, 4, 3, 4, 8, 6, 5, 4, 3, 6)
A <- c(1, 2, 1, 2, 3, 2, 1, 1, 1, 2, 1, 4, 3, 1, 2, 2, 1, 2, 4, 8)
B <- c(5, 6, 6, 5, 3, 7, 2, 1, 1, 2, 7, 4, 7, 8, 5, 7, 6, 6, 4, 7)
C <- c(9, 1, 2, 2, 1, 4, 5, 6, 7, 8, 89, 9, 7, 6, 5, 6, 8, 9 , 67, 6)
YABC <- data.frame(Y, A, B, C)

I want to loop through each variable and collect output from regression model.

This process creates the desired output, but only for this specific iteration.

model_A <- lm(Y ~ A, YABC)

ID <- 'A'
rsq <- summary(model_A)$r.squared
adj_rsq <- summary(model_A)$adj.r.squared
sig <- summary(model_A)$sigma

datA <- data.frame(ID, rsq, adj_rsq, sig)

model_B <- lm(Y ~ B, YABC)

ID <- 'B'
rsq <- summary(model_B)$r.squared
adj_rsq <- summary(model_B)$adj.r.squared
sig <- summary(model_B)$sigma

datB <- data.frame(ID, rsq, adj_rsq, sig)

model_C <- lm(Y ~ C, YABC)

ID <- 'C'
rsq <- summary(model_C)$r.squared
adj_rsq <- summary(model_C)$adj.r.squared
sig <- summary(model_C)$sigma

datC <- data.frame(ID, rsq, adj_rsq, sig)

output <- rbind(datA, datB, datC)

How can I wrap this in a loop or some other process that will account for varied number and name of columns? Here is my attempt...yes I know it's not right, just me conceptualizing the kind of capability I'd like.

# initialize data frame
output__ <- data.frame(ID__ = as.character(),
                     rsq__ = as.numeric(),
                     adj_rsq__ = as.numeric(),
                     sig__ = as.numeric())

# loop through A, then B, then C
for(i in A:C) {
  model_[i] <- lm(Y ~ [i], YABC)

  ID <- '[i]'
  rsq <- summary(model_[i])$r.squared
  adj_rsq <- summary(model_[i])$adj.r.squared
  sig <- summary(model_[i])$sigma
  data__temp <- (ID__, rsq__, adj_rsq__, sig__)
  data__ <- rbind(data__, data__temp)
}

Using @BigDataScientist approach...here is the solution I went with.

# initialize data frame
data__ <- data.frame(ID__ = as.character(),
                     rsq__ = as.numeric(),
                     adj_rsq__ = as.numeric(),
                     sig__ = as.numeric())

# loop through A, then B, then C
for(char in names(YABC)[-1]){
  model <- lm(as.formula(paste("Y ~ ", char)), YABC)
  ID__ <- paste(char)
  rsq__ <- summary(model)$r.squared
  adj_rsq__ <- summary(model)$adj.r.squared
  sig__ <- summary(model)$sigma
  data__temp <- data.frame(ID__, rsq__, adj_rsq__, sig__)
  data__ <- rbind(data__, data__temp)

}
pyll
  • 1,688
  • 1
  • 26
  • 44

3 Answers3

3

Here is a solution using *apply:

Y <- c(4, 4, 3, 4, 3, 2, 3, 2, 2, 3, 4, 4, 3, 4, 8, 6, 5, 4, 3, 6)
A <- c(1, 2, 1, 2, 3, 2, 1, 1, 1, 2, 1, 4, 3, 1, 2, 2, 1, 2, 4, 8)
B <- c(5, 6, 6, 5, 3, 7, 2, 1, 1, 2, 7, 4, 7, 8, 5, 7, 6, 6, 4, 7)
C <- c(9, 1, 2, 2, 1, 4, 5, 6, 7, 8, 89, 9, 7, 6, 5, 6, 8, 9 , 67, 6)
YABC <- data.frame(Y, A, B, C)

names <- colnames(YABC[-1])

formulae <- sapply(names,function(x)as.formula(paste('Y~',x)))

lapply(formulae, function(x) lm(x, data = YABC))

Of course you can also call summary:

lapply(formulae, function(x) summary(lm(x, data = YABC)))

If you want to extract variables from a specific model do as follows:

results <- lapply(formulae, function(x) lm(x, data = YABC))
results$A$coefficients

gives the coefficients from the model using A as explanatory var

Daniel Winkler
  • 487
  • 3
  • 11
  • I didn't even think to use lapply, I need to learn more about that function. Thanks! – pyll May 12 '17 at 16:28
  • Have a look at this question, it got me started: http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega – Daniel Winkler May 12 '17 at 16:42
1

As written in the comment: ?as.formula() is one solution. You could do sthg like:

model = list()
for(char in names(YABC)[-1]) {
  model[[char]] <- lm(as.formula(paste("Y ~ ", char)), YABC)
}
model
Tonio Liebrand
  • 17,189
  • 4
  • 39
  • 59
  • what if my variable names are weight, height, age, etc....would i just change to something like for(varname in names(YABC))? The piece I'm really struggling with is how to tell R to iterate through each variable...my variable names will not always be A, B C...they might be ZZZZZ, DF123, etc... – pyll May 12 '17 at 15:04
  • yes, well you would have to pay attention not to regress `Y` on itself :) So exclude Y with `names(YABC)[-1]`. Good suggestion though, i made an edit. – Tonio Liebrand May 12 '17 at 15:07
1

This how I do this kind of modeling. Following example assumes I am varying different outcomes, and different exposures for a given set of covariates.

I first define my outcomes and exposures I want to test (I think in terms of epidemiology but you can extend).

outcomes <- c("a","b","c","d")

exposures <- c("exp1","exp2","exp3")

The assumption is that each element specified in those vectors exist as column names in your dataset (as well as the covariates listed below after the "~").

final_lm_data <- data.frame() #initialize empty dataframe to hold results
for (j in 1:length(exposures){
  for (i in 1:length(outcomes){
    mylm <- lm(formula(paste(outcomes[i], "~", "continuous.cov.1 + 
        continuous.cov.2 + factor(categorical.variable.1)", "+",
                             exposure[j])), data=mydata)

    coefficent.table <- as.data.frame(coef(summary(mylm)))

    mylm_data <- as.data.frame(cbind(ctable,Variable = rownames(ctable),
                                     Outcome = outcomes[i],
                                     Exposure = exposures[j],
                                     Model_N = paste(length(mylm$residuals))))
    names(mylm_data)[4] <- "Pvalue"  # renaming the "Pr(>|t|)"
    rownames(mylm_data) <- NULL # important because we are creating stacked output dataset
    final_lm_data <- rbind(final_lm_data,mylm_data)
  }
}

This will give you a final_lm_data that contains your estimates, std.errors, tstatistics, pvalues for each variable in your model, and also keep track of the iteration of Outcome and Exposure (first and last elements of your model). Lastly, it has the N used after dropping data records for missing values. You can modify the mylm_data creation to capture more information from the model (such as rsq etc..).

Finally, if covariates also vary from run to run, I am not sure how to automate that part.

akaDrHouse
  • 2,190
  • 2
  • 20
  • 29