2

I trained Decision Tree model using train function from caret library:

gr = expand.grid(trials = c(1, 10, 20), model = c("tree", "rules"), winnow = c(TRUE, FALSE))
dt = train(y ~ ., data = train, method = "C5.0", trControl = trainControl(method = 'cv', number = 10), tuneGrid = gr)

Now I would like to plot Decision Tree for the final model. But this command doesn't work:

plot(dt$finalModel)

Error in data.frame(eval(parse(text = paste(obj$call)[xspot])), eval(parse(text = paste(obj$call)[yspot])),  : 
  arguments imply differing number of rows: 4160, 208, 0

Someone already asked about it here: topic

Suggested solution was to use bestTune from the fitted train object to define the relevant c5.0 model manually. And then plot that c5.0 model normally:

c5model = C5.0(x = x, y = y, trials = dt$bestTune$trials, rules = dt$bestTune$model == "rules", control = C5.0Control(winnow = dt$bestTune$winnow))
plot(c5model)

I tried to do so. Yes, it makes possible to plot c5.0 model, BUT predicted probabilities from train object and manually recreated c5.0 model don't match.

So, my question is: is it possible to extract final c5.0 model from caret::train object and plot this Decision Tree?

Helios
  • 141
  • 1
  • 2
  • 10

1 Answers1

2

The predicted probabilities should be the same, see below:

library(MASS)
library(caret)
library(C50)
library(partykit)

traindata = Pima.tr
testdata = Pima.te

gr = expand.grid(trials = c(1, 2), 
model = c("tree"), winnow = c(TRUE, FALSE))

dt = train(x = traindata[,-ncol(testdata)], y = traindata[,ncol(testdata)], 
method = "C5.0",trControl = trainControl(method = 'cv', number=3),tuneGrid=gr)

c5model = C5.0.default(x = traindata[,-ncol(testdata)], y = traindata[,ncol(testdata)], 
trials = dt$bestTune$trials, rules = dt$bestTune$model == "rules", 
control = C5.0Control(winnow = dt$bestTune$winnow))

all.equal(predict(c5model,testdata[,-ncol(testdata)],type="prob"),
predict(dt$finalModel,testdata[,-ncol(testdata)],type="prob"))
[1] TRUE

So I would suggest you double check whether the predictions are the same.

The error you see plotting the final model from caret comes from what is stored under $call which is weird, we can replace it with a call that would work for the plotting:

plot(c5model)

enter image description here

finalMod = dt$finalModel
finalMod$call = c5model$call
plot(finalMod)

enter image description here

Or you can rewrite it like you would with the results from your training but you can see it gets a bit complication with the expression (or at least I am not very good with it):

newcall = substitute(C5.0.default(x = X, y = Y, trials = ntrials, rules = RULES, control = C5.0Control(winnow = WINNOW)),
list(
X = quote(traindata[, -ncol(traindata)]),
Y = quote(traindata[, ncol(traindata)]),
RULES = dt$bestTune$model == "rules",
ntrials = dt$bestTune$trials,
WINNOW = dt$bestTune$winnow)
)

finalMod = dt$finalModel
finalMod$call = newcall
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Thank you for your help, it works! I think that I found the cause of mismatch of probabilities. When I define train function like this: train(data[,-10], data[,10]) probabilities match with probabilities from c5 model. But when I use this syntax: train(y ~., data=data) probabilities don't match. In your example both syntaxes work fine, I don't know why. – Helios Apr 07 '20 at 04:57
  • I found information about 2 methods of defining variables: "When using the formula method, factors and other classes are preserved (i.e. dummy variables are not automatically created). This particular model handles non-numeric data of some types (such as character, factor and ordered data)." (https://topepo.github.io/C5.0/reference/C5.0.html). – Helios Apr 07 '20 at 07:50
  • 1
    I see.. yeah i see the factor being written again in the call. Thanks for sharing the link. Hope everything works now for you? – StupidWolf Apr 07 '20 at 08:25