1

I need to optimize the accuracy of the C4.5 algorithm on my churn dataset using RWeka's implementation (J48()). Therefore I am using the train() function of the caret package to help me determine the optimal parameter settings (for M and C). I tried to validate the result by manually running J48() with the parameters determined by train(). The result was surprising as the manual run had a much better result.

That raises the following questions:

  • Which parameters might be different when manually executing J48()?
  • How can I get the train() function to provide a similar or better result than with manual parameter setting?
  • Or am I totally missing something here?

I'm running the following code:

library("RWeka", lib.loc="~/R/win-library/3.3")
library("caret", lib.loc="~/R/win-library/3.3")
library("gmodels", lib.loc="~/R/win-library/3.3")

set.seed(7331)

Determine best C4.5 model with J48 by using train() from package caret:

ctrl <- trainControl(method="LGOCV", p=0.8, seeds=NA)
grid <- expand.grid(.M=25*(1:15), .C=c(0.1,0.05,0.025,0.01,0.0075,0.005))

Training the model using the full dataset "response_nochar":

rtrain <- train(churn~.,data=response_nochar,method="J48",na.action=na.pass,trControl=ctrl,tuneGrid=grid)

Returns rtrain$finalmodel with prediction accuracy 0.6055 (and a tree of size 3 with 2 leaves):

# Accuracy was used to select the optimal model using  the largest value.
# The final values used for the model were C = 0.005 and M = 25.

There were approx. 50 combinations with exactly 0.6055 accuracy, ranging from the given values of the final model to (M=325, C=0.1) (with one exception inbetween).

Trying out the parameter values manually with J48:

# splitting into training and test datasets, deriving from full dataset "response_nochar"
# similar/equal to the above splitting with LGOCV and p=0.8?
response_sample <- sample(10000, 8000)
response_train <- response_nochar[response_sample,]
response_test <- response_nochar[-response_sample,]
# setting parameters
jctrl <- Weka_control(M=25,C=0.005)

Calculating the model:

c45 <- J48(churn~.,data=response_train,na.action=na.pass,control=jctrl)

Predict by using the test dataset:

pred_c45 <- predict(c45, newdata=response_test, na.action=na.pass)

Model predicts with accuracy 0.655 (and a tree of size 25 with 13 leaves).

CrossTable(response_test$churn, pred_c45, prop.chisq= FALSE, prop.c= FALSE, prop.r= FALSE, dnn= c('actual churn','predicted churn'))

PS: The dataset I use contains 10000 records and the target variable's distribution is 50:50.

m3ph
  • 53
  • 6

0 Answers0