0

I've been recently working with RPART and ran into a calculation I don't understand.

When working with information gain, how is "improve" or variable importance calculated (they seem to be the same from my tests).

As a dummy example, I tried learning the following table:

   happy,class
   yes,p
   no,n

with the command:

fit <-rpart(class ~ happy,data=train,parms = list(split="information"),minsplit=0)

It's simple, and returns the expected tree with the root and then each leaf containing one element.

Where this gets confusing, is that the improvement given for the split is 1.386294.

I would expect the improvement here to be 1 (going from entropy 1 to entropy 0 in the children), what am I missing?

Greg
  • 1
  • 3
  • Hi Greg, welcome to stackoverflow! Please provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) so that people can help you – Julian Zucker Aug 24 '17 at 16:35
  • Hi Julian, I cleaned up the original post - general insight into how the statistic is calculated is also welcome! – Greg Aug 24 '17 at 16:42
  • rpart is an implementation of CART. It uses GINI to decide node splits, not entropy. – G5W Aug 24 '17 at 17:12
  • Hi @G5W, while this is true by default, when split="information" is specified, it should use entropy. Source: https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf page 23. – Greg Aug 24 '17 at 17:16
  • I see that you used that. I stand corrected. – G5W Aug 24 '17 at 18:18

1 Answers1

0

Well, to answer this one, it's because RPART is using the natural log.

Thus, it seems that the improve score is the improvement in the entropy scaled by the number of elements in the node.

The entropy in the root node is : -ln(1/2)*1/2*2 + -ln(1/2)*1/2*2 = -ln(1/2)*2 1.38. The entropy in the leaf nodes is both 0.

Why they use natural log, I have no idea.

Greg
  • 1
  • 3