Decision tree Analysis issue

Question

I'm currently working on a problem with R. I want to apply the classification tree over a data set, but the result seems to be wrong since I've already solved the same problem using Weka, and I got different results.

I got a data set contained in a csv file as follow:

age,menopause,tumor.size,inv.nodes,node.caps,deg.malig,breast,breast.quad,irradiat,class
40-49,premeno,15-19,0-2,yes,3,right,left_up,no,recurrence-events
50-59,ge40,15-19,0-2,no,1,right,central,no,no-recurrence-events
50-59,ge40,35-39,0-2,no,2,left,left_low,no,recurrence-events
40-49,premeno,35-39,0-2,yes,3,right,left_low,yes,no-recurrence-events
40-49,premeno,30-34,3-5,yes,2,left,right_up,no,recurrence-events

and this is the script:

#Open r file
cancer = read.csv("cancer.csv")
#Data Exploration
summary(cancer)
str(cancer)
#Divide into test and train sets 
set.seed(1234) 
ind <- sample(2, nrow(cancer), replace=TRUE, prob=c(0.7, 0.3))
trainData <- cancer[ind==1,]
testData <- cancer[ind==2,]
#Build the model
library(party)
cancermodel <- class ~ age + menopause + tumor.size + inv.nodes + node.caps + deg.malig + breast + breast.quad + irradiat
cancertree <- ctree(cancermodel,trainData)
table(predict(cancertree),trainData$class)
#Draw tree
plot(cancertree, type="simple")
#Testset
testPred <- predict(cancertree, newdata = testData)
table(testPred, testData$class)

because it the same algorithm that I applied in both cases (decision tree) — Zak, Oct 25 '16 at 00:30
and your randomly sampled training and test data, should they be the same? — rawr, Oct 25 '16 at 00:31
Not really, but normally should give some ressemblance. the samples are homogenous ... — Zak, Oct 25 '16 at 01:56
What size is `cancer`? And why should the results of a random assignment of class-status result in a similar model? — IRTFM, Oct 25 '16 at 03:24

score 2 · Answer 1 · answered Oct 25 '16 at 08:34

Decision Trees have many different algorithm implementations in R (tree, rpart, party) and in Weka (J48, LMT, DecisionStump) and different algorithms are likely to produce different decision trees on the same dataset (some work by maximizing information gain / gini index, some using hypothesis tests with chi-square statistics).

Even any given algorithm will produce different decision trees with different input parameters (pruned / unpruned, min # datapoints in a node to split etc.).

Also, as pointed out by @RomRom, decision tree is not a very robust model, in the sense that slight change in the training data may output a different tree altogether.

Keeping all these in mind, it's difficult to produce the same decision tree in R and Weka, and even if you can, you have to tune your model parameters very carefully, that may require a lots of experimentations.

The following is an example on the iris dataset with a few R decision tree models and RWeka decision tree models (as can be seen from the trees plotted, different models generate different trees with the same training dataset iris).

library(RWeka)
m1 <- J48(Species ~ ., data = iris)
if(require("partykit", quietly = TRUE)) plot(m1)

library(rpart)
m2 <- rpart(Species ~ ., data = iris)
library(rpart.plot)
prp(m2)

library(party)
m3 <- ctree(Species ~ ., data = iris)
plot(m3)

score 1 · Answer 2 · answered Oct 25 '16 at 07:59

you've selected a random sample here in your code within R: ind <- sample(2, nrow(cancer), replace=TRUE, prob=c(0.7, 0.3))

how did you replicated and use the same random sample in Weka? trees are very nonrobust models and can vary quickly with different data files.

Decision tree Analysis issue

2 Answers2