0

I have a dataset with 277 observations.I have binary response variables i.e, 0 indicates no disease, and 1 indicates disease. I know that 180 of the observations have no disease and the 97 have the disease. I build a model and construct a classification tree to see how well my model correctly predicts who has the disease and who doesn't. I used the rpart function to construct a tree, and ran a summary on it.

    mytree=rpart(y~x1+x2+x3+x4, method="class")
    summary(tree)

My question is, how do I know which % of the data is classified correctly at each tip? Suppose my output is as follows:

    Node number 1: 277 observations,    complexity param=0.134
      predicted class=0  expected loss=0.35  P(node) =1
      class counts:   180    97
      probabilities: 0.650 0.350 
      left son=2 (156 obs) right son=3 (121 obs)
      Primary splits:
      x1     < 1.73 to the left,  improve=17.80, (0 missing)
      x3     < 1.44 to the left,  improve=17.80, (0 missing)
      x2    < 1.35 to the left,  improve=16.40, (0 missing)
      x4 < 3.5  to the left,  improve= 1.36, (0 missing)
   Surrogate splits:
      x2    < 1.35 to the left,  agree=0.751, adj=0.430, (0 split)
      x3     < 1.44 to the left,  agree=0.653, adj=0.207, (0 split)
      x4 < 3.5  to the right, agree=0.578, adj=0.033, (0 split)

   Node number 2: 156 observations,    complexity param=0.0258
      predicted class=0  expected loss=0.192  P(node) =0.563
      class counts:   126    30
      probabilities: 0.808 0.192 
      left son=4 (133 obs) right son=5 (23 obs)
   Primary splits:
      x3     < 1.6  to the left,  improve=4.410, (0 missing)
      x2    < 1.83 to the left,  improve=3.990, (0 missing)
      x1     < 1.27 to the left,  improve=1.410, (0 missing)
      x4 < 4.5  to the left,  improve=0.999, (0 missing)
  Node number 4: 133 observations
     predicted class=0  expected loss=0.143  P(node) =0.48
     class counts:   114    19
     probabilities: 0.857 0.143 

Note that node number 4 splits into two tips. One of the tips has 114 observations (and this is a terminal tip). It classified 114 of the 133 observations as 0. Now, how can I tell how many of the 114 is CORRECTLY classified as 0? Any insight will be greatly appreciated.

Adrian
  • 9,229
  • 24
  • 74
  • 132
  • The answers at http://stackoverflow.com/questions/11831794/testing-rules-generated-by-rpart-package will partially help. – Andrie Jul 14 '14 at 08:26
  • @Adrian can you post you tree plot ?,It will be easy to explain. – Aashu Jul 14 '14 at 10:09

1 Answers1

1

One thing you can do is write it on the plot:

# This is necessary to avoid the text being cut out
par(xpd=NA)
plot(my.tree)
text(my.tree, use.n=T)

If set to TRUE, use.n will write the number of elements belonging to each node in each class. See ?text.rpart for more help.

For example:

iris.tree <- rpart(Species~., iris)
par(xpd=NA)
plot(iris.tree)
text(iris.tree, use.n=T) 

outputs

iris data - rpart

So, essentially you have 5 virginica misclassified as versicolor, and 1 versicolor misclassified as virginica.

You can also output this as a confusion matrix, although you lose the tree structure:

> table(predict(iris.tree, t="class"), iris$Species)

             setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         49         5
  virginica       0          1        45 

If you want to do it programmatically you can use this (following from the iris example):

leaves <- which(iris.tree$frame$var == "<leaf>")
iris.tree$frame$yval2[leaves,]

                                         nodeprob
[1,] 1 50  0  0 1 0.00000000 0.00000000 0.3333333
[2,] 2  0 49  5 0 0.90740741 0.09259259 0.3600000
[3,] 3  0  1 45 0 0.02173913 0.97826087 0.3066667
nico
  • 50,859
  • 17
  • 87
  • 112