I'm training a predictive model (glm - logistic), and I want to include most but not all of my dataframe variables in the model. So I have code that looks like this:
logModel = glm(Y ~ . -X1 -X2 -X3, data=train, family=binomial)
modelPrediction <- predict(logModel, type="response", newdata=test)
But I was getting "Factor has New Levels" errors - for X1-X3 that I had specifically excluded from my model. A SO user explained in a comment below that using [Y ~ . -X1]
exposes me to errors from X1, because the model is expanded to Y ~ [includedVars + X1] - X1
. And apparently, all string variables are converted to factors by glm(...type="response"), so any String variable will in general throw that "Factor has New Levels" error. So the suggestion at that SO link is to remove the variables from the train/test datasets entirely. Which works but seems clunky and unideal.
"Factor has new levels" error for variable I'm not using
In SAS I can select
"all variables between someVar and someOtherVar in the dataset"
with the code myvar1 -- someOtherVar.
Does anything exist like this in R? If so I've been unable to find it. But if [Y~. -X]
exposes you to errors when running models, I've got to think there's a cleaner way.