-2


I have a data frame with colnames A, B, C, D with numeric values. I am trying to generate a linear regression model using variables and trying all the possible combination like A, A+B, A+C, B, B+C ....
I am having trouble generating combinations with data frame.

Data frame
DependentVar A B C D 

I am trying to generate something like this:
Combinations of independent variables like:

var <- A,B,C,D,A+B,A+C,A+D,B+C,B+D,C+D,A+B+C,A+B+D and so on..
for (v in var){
models <- lm (DependentVar ~ eval(parse(text=v)), data=data)
r2 <- append(summary(models)$r.squared)
}

Output like dataframe:

Variable combination  Model R2    
A                      0.8
B                      0.7
.
.

and so on
Any help will be greatly appreciated!

rkg
  • 27
  • 6

2 Answers2

1

You have the right idea, but you can improve results by 1) using lapply() and 2) using as.formula()

set.seed(1)
d<-data.frame(DV=rnorm(100,mean=100,sd=10),A=rnorm(100,mean=100,sd=10),B=rnorm(100,mean=100,sd=10))

formula_list<-list(as.formula('DV ~ A'),
                   as.formula('DV ~ B'),
                   as.formula('DV ~ A + B'))

lapply(formula_list, FUN = lm, data=d)

To get the output data frame, you can use this same machinery, but instead of FUN=lm, set FUN= to be a wrapper for lm that will do the post-regression processing.

lm_wrapper<-function(formula, data){
  reg_res<-lm(formula, data=data)
  rsq<-summary(reg_res)$r.squared
  return(data.frame(formula=as.character(formula)[3], rsq=rsq))
}

all_res<-lapply(formula_list, FUN = lm_wrapper, data=d)

all_res_stack<-do.call('rbind',all_res)

Here is what all_res_stack looks like:

> all_res_stack
  formula         rsq
1       A 0.004809535
2       B 0.026144428
3   A + B 0.026821577
AOGSTA
  • 698
  • 4
  • 11
  • Thank you for your comment. I have one doubt, how to generate the formula_list automatically and not writing it down manually. My actual data frame has more than 20 columns and writing all of them would be nearly impossible. Could you suggest some code to generate the variable combination list. Thanks – rkg Jun 23 '16 at 19:40
  • Aggree with @ZheyuanLi. I got halfway through this post before it was marked as a dup. But the original answer has ways to generate the formula automatically. As an aside, are you sure whatever you're doing is a good thing to do? Very rarely have I seen people estimate regressions on an industrial scale and only look at R^2 statistics in a statistically sound way. – AOGSTA Jun 23 '16 at 20:10
0
set.seed(123)

mydata <- data.frame(A = rnorm(10, mean = 5),
                     B = rnorm(10, mean = 10),
                     C = rnorm(10, sd = 2),
                     D = rnorm(10, sd = 5))
mydata$DependentVar <- with(mydata, A + B + C + D + rnorm(10))

# expand.grid makes a data.frame, where each possible combination of values is
# given a row. Here, each row states which variables to use in a model. Remove
# the row where no variables are used.
independent_vars <- c('A', 'B', 'C', 'D')
include_choices <- lapply(independent_vars, function(x) c(TRUE, FALSE))
names(include_choices) <- independent_vars

combos <- do.call('expand.grid', args = include_choices)

combos <- combos[apply(combos, 1, any), ]

# Use combos to construct each model
predict_some_cols <- function(which_cols) {
  model_vars <- c('DependentVar', colnames(combos)[which_cols])
  lm(DependentVar ~ ., data = mydata[, model_vars])
}

model_list <- apply(combos, 1L, predict_some_cols)

# A really weird-looking way makes names, please somebody improve this
names(model_list) <- apply(combos, 1,
                           FUN = function(which_cols) {
                             paste0(colnames(combos)[which_cols],
                                    collapse = ' + ')
                           })

# Now go through the models and get the desired data.
rsquared <- vapply(model_list,
                   function(model) summary(model)$r.squared,
                   numeric(1))
Nathan Werth
  • 5,093
  • 18
  • 25