1

I'm training a predictive model (glm - logistic), and I want to include most but not all of my dataframe variables in the model. So I have code that looks like this:

logModel = glm(Y ~ . -X1 -X2 -X3, data=train, family=binomial)
modelPrediction <- predict(logModel, type="response", newdata=test) 

But I was getting "Factor has New Levels" errors - for X1-X3 that I had specifically excluded from my model. A SO user explained in a comment below that using [Y ~ . -X1]exposes me to errors from X1, because the model is expanded to Y ~ [includedVars + X1] - X1. And apparently, all string variables are converted to factors by glm(...type="response"), so any String variable will in general throw that "Factor has New Levels" error. So the suggestion at that SO link is to remove the variables from the train/test datasets entirely. Which works but seems clunky and unideal. "Factor has new levels" error for variable I'm not using

In SAS I can select

"all variables between someVar and someOtherVar in the dataset"

with the code myvar1 -- someOtherVar. Does anything exist like this in R? If so I've been unable to find it. But if [Y~. -X] exposes you to errors when running models, I've got to think there's a cleaner way.

Community
  • 1
  • 1
Max Power
  • 8,265
  • 13
  • 50
  • 91
  • 3
    You cannot express ranges of values in R formulas like you can in SAS. You can with `subset()`. You could subset the columns this way before passing into the model. `glm(Y ~ . , data=subset(train, select=myvar1:myvar9), family=binomial)` – MrFlick Apr 29 '15 at 01:32
  • Thanks MrFlick, I think I'll go with subsetting my train/test datasets in the glm() call like you suggest, either using subset/select, or something like train[c(1,5:10)] (taken from http://www.statmethods.net/management/subset.html) – Max Power Apr 29 '15 at 01:51
  • Would explicitly constructing the formula argument be a viable option? In my experience, this is the cleanest way down the line. – Roman Luštrik Apr 29 '15 at 05:53
  • Hi Roman, that would be viable but somewhat involved when you have tens of features, or even potentially hundreds in early models when still doing feature selection. But maybe you're right that would be cleanest. – Max Power Apr 29 '15 at 07:02

0 Answers0