34

I am trying to build a regression model with lm(...). My dataset has lots of features( >50). I do not want to write my code as:

lm(output ~ feature1 + feature2 + feature3 + ... + feature70)

I was wondering what is the short hand notation to write this code?

Karolis Koncevičius
  • 9,417
  • 9
  • 56
  • 89
iinception
  • 1,945
  • 2
  • 21
  • 19
  • The first result of the search "[r] formula many variables" answers your question. – Joshua Ulrich Apr 25 '11 at 16:32
  • Duplicate of [How do I fit a model without specifying the number of variables?](http://stackoverflow.com/q/3384567/271616) and [how to succinctly write a formula with many variables from a data frame?](http://stackoverflow.com/q/5251507/271616) and [Specifying formula in R with glm without explicit declaration of each covariate](http://stackoverflow.com/q/3588961/271616). – Joshua Ulrich Apr 25 '11 at 16:32
  • 1
    See also: http://stackoverflow.com/questions/4951442/formula-with-dynamic-number-of-variables – landroni Sep 22 '14 at 13:27

2 Answers2

60

You can use . as described in the help page for formula. The . stands for "all columns not otherwise in the formula".

lm(output ~ ., data = myData).

Alternatively, construct the formula manually with paste. This example is from the as.formula() help page:

xnam <- paste("x", 1:25, sep="")
(fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+"))))

You can then insert this object into regression function: lm(fmla, data = myData).

Chase
  • 67,710
  • 18
  • 144
  • 161
6

Could also try things like:

lm(output ~ myData[,2:71], data=myData)

Assuming output is the first column feature1:feature70 are the next 70 columns.

Or

features <- paste("feature",1:70, sep="")
lm(output ~ myData[,features], data=myData)

Is probably smarter as it doesn't matter where in amongst your data the columns are.

Might cause issues if there's row's removed for NA's though...

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
nzcoops
  • 9,132
  • 8
  • 41
  • 52