I have a data set that I use the model.matrix()
function on to convert factor variables to dummy variables. My data has 10 columns like this each with 3 levels (2,3,4) and I've been creating dummy variables for each of them separately.
xFormData <- function(dataset){
mm0 <- model.matrix(~ factor(dataset$type) , data=dataset)
mm1 <- model.matrix(~ factor(dataset$type_last1), data = dataset)
mm2 <- model.matrix(~ factor(dataset$type_last2), data = dataset)
mm3 <- model.matrix(~ factor(dataset$type_last3), data = dataset)
mm4 <- model.matrix(~ factor(dataset$type_last4), data = dataset)
mm5 <- model.matrix(~ factor(dataset$type_last5), data = dataset)
mm6 <- model.matrix(~ factor(dataset$type_last6), data = dataset)
mm7 <- model.matrix(~ factor(dataset$type_last7), data = dataset)
mm8 <- model.matrix(~ factor(dataset$type_last8), data = dataset)
mm9 <- model.matrix(~ factor(dataset$type_last9), data = dataset)
mm10 <- model.matrix(~ factor(dataset$type_last10), data = dataset)
dataset <- cbind(dataset, mm0, mm1, mm2, mm3, mm4, mm5, mm6, mm7, mm8, mm9, mm10)
dataset
}
I'm wondering if this is the wrong procedure as after running a randomForest
on the data, and plotting the variable importance, it was showing different dummy variable columns individually. So say columns 61-63 were the 3 dummy variables for column 10, the randomForest
is seeing column 62 by itself as an important predictor.
I have 2 questions:
1) Is this ok?
2) If not, how can I group the dummy variables so that the rf knows they are together?