Inputting a whole data frame as independent variables in a logistic regression

Question

Possible Duplicate:
short formula call for many variables when building a model

I have a biggish data frame (112 variables) that I'd like to do a stepwise logistic regression on using R. I know how to setup the glm model and the stepAIC model, but I'd rather not type in all the headings to input the independent variables. Is there a fast way to give the glm model an entire data frame as independent variables such that it will recognize each column as an x variable to be included in the model? I tried:

ft<-glm(MFDUdep~MFDUind, family=binomial)

But it didn't work (wrong data types). MFDUdep and MFDUind are both data frames, with MFDUind containing 111 'x' variables and MFDUdep containing a single 'y'.

May I make a rather strong suggestion that unless you have a **really** good reason to the contrary you use some form of penalized regression (e.g. the `glmnet` package) rather than stepwise regression? — Ben Bolker, Dec 27 '12 at 18:19
I'll take that into consideration, but it doesn't really answer my question. — TomR, Dec 27 '12 at 18:21
It wasn't meant to answer your question, but if someone asks me for directions to the airport, and I notice that their hair is on fire, I'm going to mention it anyway. — joran, Dec 27 '12 at 18:24
Using stepwise regression is equivalent to a hair fire, huh? I've used best subsets on smaller data samples and typically haven't found that much difference in practice (between that and f-b stepwise). — TomR, Dec 27 '12 at 18:27
Hi @TomR I've noticed you haven't accepted any answers to your 5 [so] questions despite thanking people for a good response. See the [ask] section of the [faq] for details on how to accept answers and why this is useful. Do note that you aren't obliged to do this and you should only accept an answer if it does answer your Q. — Gavin Simpson, Dec 27 '12 at 18:28
... and I saw that @Didzis had already answered the question (and now GavinSimpson has given a more thorough answer), so I didn't feel the need to give an answer along with my warning ... you can search on google or http://stats.stackexchange.com for "stepwise regression problems" or some such to see why I and others are giving you this warning ... — Ben Bolker, Dec 27 '12 at 18:29
@TomR Ben's suggestion wasn't to use best subsets which **isn't** a penalised regression method. He suggested you use a shrinkage technique such as the elastic net. Those techniques have been designed for regression problems such as yours, whereas step-wise regression has been shown to be quite deficient. — Gavin Simpson, Dec 27 '12 at 18:31
TomR: best subsets is subject to most of the same issues as stepwise. Stepwise has some additional arbitrariness built in, but the real problem is with inducing bias and inflating type I errors. The "right" technique depends on your precise goal, but penalized regression generally dominates older/more naive approaches. (What @GavinSimpson said too.) — Ben Bolker, Dec 27 '12 at 18:31
also: http://stackoverflow.com/questions/5774813/short-formula-call-for-many-variables-when-building-a-model , http://stackoverflow.com/questions/3588961/specifying-formula-in-r-with-glm-without-explicit-declaration-of-each-covariate (voting to close as duplicate) — Ben Bolker, Dec 27 '12 at 18:35

score 7 · Accepted Answer · answered Dec 27 '12 at 18:21

7

You want the . special symbol in the formula notation. Also, it is probably better to have the response and predictors in the single data frame.

Try:

MFDU <- cbind(MFDUdep, MFDUind)
ft <- glm(y ~ ., data = MFDU, family = binomial)

Now that I have given you the rope, I am obliged to at least warn you about the potential for hanging...

The approach you are taking is usually not the recommended one, unless perhaps prediction is the purpose of the model. Regression coefficient for selected variables may be strongly biased so if you are using this for enlightenment, then rethink your approach.

You will also need a lot of observations to allow 100+ terms in a model.

Better alternative exist; e.g. see the glmnet package for one such approach which allows for ridge, lasso or both (elastic net) constraints on the set of coefficients, which allows one to minimise model error at the expense of a small amount of additional bias.

answered Dec 27 '12 at 18:21

Gavin Simpson

170,508
25
396
453

Thanks for the answer. I'm using it to identify significant variables for differences between two population groups, along with several types of decision trees. I'll look further into glmnet. – TomR Dec 27 '12 at 18:28
In that case don't believe anything `stepAIC` tells you in that regard; many statisticians would consider it a random significant variable selector. – Gavin Simpson Dec 27 '12 at 18:32
thanks for the advice...can you recommend a reference on why that is? From what I've read in stats and information theory AIC is a useful criterion. – TomR Dec 27 '12 at 19:29
1

@TomR Gavin would be a better source for a reference than me, but just so you know, it's not that AIC itself is a bad measure, its the stepwise nature of the procedures themselves that are problematic. – joran Dec 27 '12 at 19:47
@TomR AIC is useful in terms of model *averaging* or weighting candidate models. In the manner you intended to use it, it is just a restatement of the p-value. Try Frank Harrell's [Regression Modelling Strategies](http://www.amazon.com/Regression-Modeling-Strategies-Frank-Harrell/dp/0387952322) for one oft-mentioned discussion of this topic. – Gavin Simpson Dec 27 '12 at 21:41
1

Teaching a naive use of `stepAIC` is one of the weaker points in `MASS` (the book and the package/bundle). I always wonder why Brian Ripley, who otherwise complains about everything that is not waterproof, and mostly with good reasons, has let it pass in two revisions. – Dieter Menne Dec 28 '12 at 10:26

Inputting a whole data frame as independent variables in a logistic regression

1 Answers1