0

I am using the rpart package in r studio. I'm not sure if I need to reduce the font, spread the branches out more, or do some type of pruning?

c.tree1 <- rpart(certified ~ grade + forum.posts + assignment, 
                 method="class", data=M1, 
                 control=rpart.control(minsplit=1,minbucket=1, cp=0.001))
> 
> #Check the results from the classifcation tree using the printcp() command
> printcp(c.tree1)

Classification tree:
rpart(formula = certified ~ grade + forum.posts + assignment, 
    data = M1, method = "class", control = rpart.control(minsplit = 1, 
        minbucket = 1, cp = 0.001))`

Variables actually used in tree construction:
[1] assignment  forum.posts grade      

Root node error: 204/1000 = 0.204

n= 1000 

          CP nsplit rel error xerror     xstd
1  0.0044563      0   1.00000 1.0000 0.062466
2  0.0039216     20   0.90196 1.1373 0.065433
3  0.0036765     36   0.83333 1.2549 0.067651
4  0.0032680     40   0.81863 1.3088 0.068577
5  0.0029412     53   0.77451 1.3627 0.069448
6  0.0028011     65   0.73529 1.3627 0.069448
7  0.0024510    100   0.61765 1.4657 0.070968
8  0.0016340    198   0.37255 1.4853 0.071237
9  0.0012255    250   0.27451 1.6029 0.072720
10 0.0010000    262   0.25980 1.6324 0.073056
> #Plot your tree
> plot(c.tree1)
> text(c.tree1)

Unreadable tree

Barker
  • 2,074
  • 2
  • 17
  • 31
  • 3
    That's not really a question for us. You need to decide how you want to display your data. Seems like you have an awful lot of data there. what do you actually want to draw? Also it works best when asking for help to include a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input data so we can run the code and re-create the plot ourselves. Otherwise it's hard to make specific suggestions. – MrFlick Dec 21 '16 at 20:12
  • 1
    You absolutely need to do some pruning. If I'm not mistaken, it looks like you grew out the partitions so that each subject is classified into their own partition (`minbucket=1`). This is overfit data and isn't particularly useful. Look into objective methods for determining the number of splits and the size of your trees. Maybe start here with this link: [link](http://stackoverflow.com/questions/29197213/what-is-the-difference-between-rel-error-and-x-error-in-a-rpart-decision-tree) – David Dec 22 '16 at 03:30
  • 1
    +1 for the comment by @David. If I understand the information you provided correctly, you have grown the tree until all nodes are pure (i.e., belong to the same response class). For 1000 observations you have used 262 splits and thus 263 terminal nodes. This is very unlikely to be useful in practice. Moreover, the cross-validation error (`xerror`) reported by the fit is smallest for the root node without any splits. Thus, it seems to be very important to either prune your `rpart` tree or possibly use a pre-pruned tree based on significance testing like `ctree` from `partykit`. – Achim Zeileis Dec 25 '16 at 16:03

0 Answers0