1

I have a dataset where I would like to one-hot encode one variable and built a model (lm).

This variable is called 'zone'.

What I tried to do is:

lm_model <- train(formula(paste0("price ~", paste0(features, collapse = " + "))),
                  data = predict(dummyVars( ~ "zone", data = data_train), newdata =  data_train), 
                  method = "lm", 
                  trControl = trainControl(method = "cv", number = 10),
                  preProcess = c("center", "scale"),
                  na.action=na.exclude
)

I am not sure that regarding the part, could someone please guide me here:

data = predict(dummyVars( ~ "zone", data = data_train), newdata =  data_train), 
kskirpic
  • 155
  • 1
  • 1
  • 7
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Apr 13 '20 at 18:42

1 Answers1

1

Let's use an example with cyl as a categorical from mtcars:

library(caret)
da <- mtcars
da$cyl <- factor(da$cyl)
# we can include cyl as features
features <- c("cyl","hp","drat","wt","qsec")
#our dependent is mpg

We check what dummyVars does:

    head(predict(dummyVars(mpg~.,data=da[,c("mpg",features)]),da))
                  cyl.4 cyl.6 cyl.8  hp drat    wt  qsec
Mazda RX4             0     1     0 110 3.90 2.620 16.46
Mazda RX4 Wag         0     1     0 110 3.90 2.875 17.02
Datsun 710            1     0     0  93 3.85 2.320 18.61
Hornet 4 Drive        0     1     0 110 3.08 3.215 19.44
Hornet Sportabout     0     0     1 175 3.15 3.440 17.02
Valiant               0     1     0 105 2.76 3.460 20.22

You can see it introduces 3 binary variables for cyl, and also keeps the continuous variables. the dependent variable is not in this predict(...)

So for the training:

onehot_data <- cbind(mpg=da$mpg,
predict(dummyVars(mpg~.,data=da[,c("mpg",features)]),da))

lm_model <- train(mpg ~.,data=onehot_data,  
                  method = "lm", 
                  trControl = trainControl(method = "cv", number = 10),
                  preProcess = c("center", "scale"),
                  na.action=na.exclude
)

And it throws you a warning:

Warning messages:
1: In predict.lm(modelFit, newdata) :
  prediction from a rank-deficient fit may be misleading

For linear models, caret fits a model with intercept. Because you have only one categorical value, your intercept will be a linear combination of your onehot encoded variables.

You need to decide which of your categorical will be a reference level, and remove that column from the onehot data frame, for example:

# i remove cyl.4
onehot_data = onehot_data[,-2]
lm_model <- train(mpg ~.,data=onehot_data,  
                  method = "lm", 
                  trControl = trainControl(method = "cv", number = 10),
                  preProcess = c("center", "scale"),
                  na.action=na.exclude
)
StupidWolf
  • 45,075
  • 17
  • 40
  • 72