6

A similar question was asked however the link in the answer points to random forest example, it doesn't seem to work in my case.

Here is an example what I'm trying to do:

gbmGrid <-  expand.grid(interaction.depth = c(5, 9),
                    n.trees = (1:3)*200,
                    shrinkage = c(0.05, 0.1))

fitControl <- trainControl(
                       method = "cv",
                       number = 3,
                       classProbs = TRUE)

gbmFit <- train(strong~.-Id-PlayerName, data = train[1:10000,],
             method = "gbm",
             trControl = fitControl,
             verbose = TRUE,
             tuneGrid = gbmGrid)
gbmFit

Everything goes fine, I get the best parameters. Now if I do the prediction:

predictStrong = predict(gbmFit, newdata=train[11000:50000,])

I get a binary vector of predictions, which is good:

[1] 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 1 ...

However when I try to get probabilities, I get an error:

predictStrong = predict(gbmFit, newdata=train[11000:50000,], type="prob")

Error in `[.data.frame`(out, , obsLevels, drop = FALSE) : 
undefined columns selected

Where seems to be the problem?

Additional info:

traceback()
5: stop("undefined columns selected")
4: `[.data.frame`(out, , obsLevels, drop = FALSE)
3: out[, obsLevels, drop = FALSE]
2: predict.train(gbmFit, newdata = train[11000:50000, ], type = "prob")
1: predict(gbmFit, newdata = train[11000:50000, ], type = "prob")

Versions:

R version 3.1.0 (2014-04-10) -- "Spring Dance"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)

caret version: 6.0-29

EDIT: I've seen this topic as well and I don't get an error about variable names, although I have couple of variable names with underscores, which I assume it's valid, as I use make.names and get the same names as the original.

colnames(train) == make.names(colnames(train))
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Community
  • 1
  • 1
enedene
  • 3,525
  • 6
  • 34
  • 41
  • Where is the `train` data set coming from? This code is not run-able without it. – MrFlick Jun 06 '14 at 18:09
  • You should show the results of `str(train$strong)`. I suspect that you are doing regression (not classification) since the predicted values appear to be numbers. For classification, make `strong` a factor with levels that are not "0" and "1". – topepo Jun 07 '14 at 02:04
  • MrFlick, unfortunately I'm not allowed to share the data. @topepo I can't check till Sunday, I'm quite sure that it's factor with levels 0 and 1, but I will get back to you when I'll be able to check. Of course I want to do a classification, not regression. Thank you. – enedene Jun 07 '14 at 05:06
  • If it is a factor, you should have seen a warning when you fit the model that the factor level values might cause errors (since they are not valid variable names). – topepo Jun 08 '14 at 16:14
  • 1
    @topepo the problem was, as you've said that levels were "0" and "1", changing the levels to "strong" and "weak" I obtained the probabilities as you said. Thank you. Please provide a formal answer so I can give you credit and close the question. – enedene Jun 08 '14 at 21:28

3 Answers3

9

When class probabilities are requested, train puts them into a data frame with a column for each class. If the factor levels are not valid variable names, they are automatically changed (e.g. "0" becomes "X0"). train issues a warning in this case that goes something like "At least one of the class levels are not valid R variables names. This may cause errors if class probabilities are generated."

topepo
  • 13,534
  • 3
  • 39
  • 52
1

As topepo explained above, the function is getting confused by the variable names being generated.

If you run:

make.names(levels(traintestClass_subset))

and the result is different to how you have labelled the classes in your outcome variable than this issue will occur. Just make sure names generated by the piece of code above align with the class names you have provided to your factor and it should work.

bibzzzz
  • 193
  • 1
  • 10
0

It's the key:

I get a binary vector of predictions, which is good:

[1] 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 1 ...

Your factor labels could be interpret as numeric. I don't know why, but if you change 0 on 'a' and 1 on 'b' for instance, it will work without errors.

Vitaly
  • 21
  • 3