Let's use an example with cyl as a categorical from mtcars:
library(caret)
da <- mtcars
da$cyl <- factor(da$cyl)
# we can include cyl as features
features <- c("cyl","hp","drat","wt","qsec")
#our dependent is mpg
We check what dummyVars does:
head(predict(dummyVars(mpg~.,data=da[,c("mpg",features)]),da))
cyl.4 cyl.6 cyl.8 hp drat wt qsec
Mazda RX4 0 1 0 110 3.90 2.620 16.46
Mazda RX4 Wag 0 1 0 110 3.90 2.875 17.02
Datsun 710 1 0 0 93 3.85 2.320 18.61
Hornet 4 Drive 0 1 0 110 3.08 3.215 19.44
Hornet Sportabout 0 0 1 175 3.15 3.440 17.02
Valiant 0 1 0 105 2.76 3.460 20.22
You can see it introduces 3 binary variables for cyl, and also keeps the continuous variables. the dependent variable is not in this predict(...)
So for the training:
onehot_data <- cbind(mpg=da$mpg,
predict(dummyVars(mpg~.,data=da[,c("mpg",features)]),da))
lm_model <- train(mpg ~.,data=onehot_data,
method = "lm",
trControl = trainControl(method = "cv", number = 10),
preProcess = c("center", "scale"),
na.action=na.exclude
)
And it throws you a warning:
Warning messages:
1: In predict.lm(modelFit, newdata) :
prediction from a rank-deficient fit may be misleading
For linear models, caret fits a model with intercept. Because you have only one categorical value, your intercept will be a linear combination of your onehot encoded variables.
You need to decide which of your categorical will be a reference level, and remove that column from the onehot data frame, for example:
# i remove cyl.4
onehot_data = onehot_data[,-2]
lm_model <- train(mpg ~.,data=onehot_data,
method = "lm",
trControl = trainControl(method = "cv", number = 10),
preProcess = c("center", "scale"),
na.action=na.exclude
)