3

I am confused about how to predict on new data after caret model tuning. First it sounds easy; just tune a model and use the model$finalModel, but there are some problems.

5.5.1 Pre-Processing Options

These processing steps would be applied during any predictions generated using predict.train, extractPrediction or extractProbs (see details later in this document). The pre-processing would not be applied to predictions that directly use the object$finalModel object. link

As you can read here (its the creator of the package)

In other words, any pre-processing that you ask for is done to the training set prior to running randomForest. It also applied the same pre-processing to any data that you predict on (using predict(RFFit, testSet)). If you use the finalModel object, you are using predict.randomForest instead of predict.train and none of the pre-processing is done before prediction. link

And next

First, almost never use the $finalModel object for prediction. Use predict.train. This is one good example of why.

There is some inconsistency between how some functions (including randomForest and train) handle dummy variables. Most functions in R that use the formula method will convert factor predictors to dummy variables because their models require numerical representations of the data. The exceptions to this are tree- and rule-based models (that can split on categorical predictors), naive Bayes, and a few others. link

In the following example see "2. Create A Standalone Model" (no pre-processing) you can see how we can solve this, he builds a complete new "standalone model" and it is used to predict. It's easy to add pre-processing here again.

....
# create standalone model using all training data
set.seed(7)
finalModel <- randomForest(Class~., training, mtry=2, ntree=2000)
# make a predictions on "new data" using the final model
final_predictions <- predict(finalModel, validation[,1:60])
confusionMatrix(final_predictions, validation$Class)

BUT I don't want to change my code every time or manually extract and insert the parameter from object$finalModel, that's why I am using caret?

I want to use the object$finalModel or at least extract its parameter grid, BUT for multiple models (i.e. gbm,knn,...)

::: linux-gnu R version 3.4.1 (2017-06-30) ::: Single Candle :::

r_get_best <- function(.mth="gbm") {

    library(caret)
    library(mlbench)
    library(randomForest)

    data(Sonar)
    idx <- createDataPartition(Sonar$Class,p=0.80,list=F)
    trn <- Sonar[idx,]
    tst <- Sonar[-idx,]

    # settings
    prp <- c("center","scale")
    tcl <- trainControl(method="repeatedcv",number=3,repeats=3,preProcOptions=prp,verboseIter=F)

    # train model(.mth)
    set.seed(7)
    fit <- train(Class~.,data=trn,method=.mth,metric="Accuracy",preProcess=prp,trControl=tcl)
    print(fit)
    print("")
    print("----- final model -----")
    print(fit$finalModel)

    # i can get the result, but this way i have to construct a new grid + re-train a new final model
    frs <- fit$results[which.max(fit$results[,fit$metric]),]
    print("")
    print("----- final model extracted -----")
    print(frs)

    # use fit$finalModel OR fit$bestTune ???
    fml <- NULL # ??????????

    # make test prediction on UNSEEN tst data using the final model
    # prd <- predict(fml,tst[,-ncol(tst)])

    # check the quality
    # cfm <- confusionMatrix(prd,tst$Class)
    # print("")
    # print(cfm)

    return(fml)

}

> r_get_best("gbm")

I can't use predict(model$finalModel). I also can't use predict.train() because I am using the formula interface + pre-processing.

So the question is, how can I use the object$finalModel or extract its parameter setup for multiple methods ="gbm,knn,rf,..." to predict on new (unseen) data, even when I am using formula interface + pre-processing + maybe cross-validation before?

Or is it the best way to use the non-formula interface?

Shame on me, one part of the question is solved: The final values used for the model were n.trees = 150, interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.

> Tuned Parameters can be accessed through model$bestTune

 n.trees interaction.depth shrinkage n.minobsinnode
 150               2         0.1             10

But is this the way to go, create a new model and train it on all the data again?

jmuhlenkamp
  • 2,102
  • 1
  • 14
  • 37
laz
  • 31
  • 5

0 Answers0