5

I have a data set that I use the model.matrix() function on to convert factor variables to dummy variables. My data has 10 columns like this each with 3 levels (2,3,4) and I've been creating dummy variables for each of them separately.

xFormData <- function(dataset){
    mm0 <- model.matrix(~ factor(dataset$type) , data=dataset)
    mm1 <- model.matrix(~ factor(dataset$type_last1), data = dataset)
    mm2 <- model.matrix(~ factor(dataset$type_last2), data = dataset)
    mm3 <- model.matrix(~ factor(dataset$type_last3), data = dataset)
    mm4 <- model.matrix(~ factor(dataset$type_last4), data = dataset)
    mm5 <- model.matrix(~ factor(dataset$type_last5), data = dataset)
    mm6 <- model.matrix(~ factor(dataset$type_last6), data = dataset)
    mm7 <- model.matrix(~ factor(dataset$type_last7), data = dataset)
    mm8 <- model.matrix(~ factor(dataset$type_last8), data = dataset)
    mm9 <- model.matrix(~ factor(dataset$type_last9), data = dataset)
    mm10 <- model.matrix(~ factor(dataset$type_last10), data = dataset)

    dataset <- cbind(dataset, mm0, mm1, mm2, mm3, mm4, mm5, mm6, mm7, mm8, mm9, mm10)

dataset
}

I'm wondering if this is the wrong procedure as after running a randomForest on the data, and plotting the variable importance, it was showing different dummy variable columns individually. So say columns 61-63 were the 3 dummy variables for column 10, the randomForest is seeing column 62 by itself as an important predictor.

I have 2 questions:

1) Is this ok?

2) If not, how can I group the dummy variables so that the rf knows they are together?

screechOwl
  • 27,310
  • 61
  • 158
  • 267
  • 2
    You do not need to create dummy variables: making sure that they are factors (rather than numbers) should suffice. – Vincent Zoonekynd Feb 12 '12 at 23:06
  • @VincentZoonekynd This is actually a follow-up to http://stackoverflow.com/questions/9145874/r-caret-rfe-variable-selection-for-factors-and-nas/9147316#9147316 , where the OP found that his machine learning workflow *does not* work with factor-coded features. – John Colby Feb 13 '12 at 19:27

1 Answers1

3

This is OK, and is what happens behind the scenes anyway if you left the factors as factors. Different levels of a factor are different features for most machine learning purposes. Think of a random example like test outcome ~ school: Maybe going to school A is very predictive of whether you pass or fail the test, but not school B or school C. Then, the school A feature would be useful, but not the others.

This is covered in one of the caret vignette documents: http://cran.r-project.org/web/packages/caret/vignettes/caretMisc.pdf

Also, the cars data set included with caret should be a useful example. It contains 2 factors - "manufacturer" and "car type" - that have been dummy-coded into a series of numeric features for machine learning purposes.

data(cars, package='caret')
head(cars)
John Colby
  • 22,169
  • 4
  • 57
  • 69
  • Thanks. As a follow up I think if you go about it this way you can't use n-1 levels, but must explicitly code each level which is described in this question: http://stackoverflow.com/questions/4560459/all-levels-of-a-factor-in-a-model-matrix-in-r – screechOwl Feb 15 '12 at 19:44