7

I have a purely categorical dataframe from the UCI machine learning database https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008

I am using rpart to form a decision tree based on a new category on whether patients return before 30 days (a new failed category).

I am using the following parameters for my decision tree

    tree_model <- rpart(Failed ~ race + gender + age+ time_in_hospital+ medical_specialty + num_lab_procedures+ num_procedures+num_medications+number_outpatient+number_emergency+number_inpatient+number_diagnoses+max_glu_serum+ A1Cresult+metformin+glimepiride+glipizide+glyburide+pioglitazone+rosiglitazone+insulin+change,method="class", data=training_data, control=rpart.control(minsplit=2, cp=0.0001, maxdepth=20, xval = 10), parms = list(split = "gini"))

Printing the results yields:

       CP     nsplit rel error  xerror     xstd
1 0.00065883      0   1.00000  1.0000   0.018518
2 0.00057648      8   0.99424  1.0038   0.018549
3 0.00025621     10   0.99308  1.0031   0.018543
4 0.00020000     13   0.99231  1.0031   0.018543

I see that the relative error is going down as the decision tree branches off, but the xerror goes up - which I don't understand as I would have thought that the error would reduce the more branches there are and the more complex the tree is.

I take it that the xerror is most important, since most methods for tree pruning would cut the tree at the root.

Why is the xerror what is focused on when pruning the tree? And when we summarise what the error of the decision tree classifier is, is the error 0.99231 or 1.0031?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
user1745691
  • 305
  • 2
  • 5
  • 12
  • This is a conceptual problem, not a coding question. You should find a forum whare this is on-topic. Perhaps CrossValidated.com. – IRTFM Mar 22 '15 at 17:03
  • 2
    I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory and/or methodology - please see the intro and NOTE in https://stackoverflow.com/tags/machine-learning/info – desertnaut Nov 17 '21 at 09:18

2 Answers2

14

The x-error is the cross-validation error (rpart has built-in cross validation). You use the 3 columns, rel_error, xerror and xstd together to help you choose where to prune the tree.

Each row represents a different height of the tree. In general, more levels in the tree mean that it has lower classification error on the training. However, you run the risk of overfitting. Often, the cross-validation error will actually grow as the tree gets more levels (at least, after the 'optimal' level).

A rule of thumb is to choose the lowest level where the rel_error + xstd < xerror.

If you run plotcp on your output it will also show you the optimal place to prune the tree.

Also, see the SO thread How to compute error rate from a decision tree?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Harold Ship
  • 989
  • 1
  • 8
  • 14
9

I would like to add some info to @Harold Ship's answer. Actually, there are three ways to select the optimal cp value for pruning:

  1. Use the first level (i.e. least nsplit) with minimum xerror. The first level only kicks in when there are multiple level having same, minimum xerror. This is the most common used method.

  2. Use the first level where xerror falls into the ±1 xstd range of min(xerror), i.e., xerror < min(xerror) + xstd, the level whose xerror is at or below horizontal line. This method takes into account the variability of xerror resulting from cross-validation.

    Note: rel_error should NOT be used in pruning.

  3. (A rarely used method) Use the first level where xerror ± xstd overlaps with min(xerror) ± xstd. i.e., the level whose lower limit is at or below horizontal line.

user2830451
  • 2,126
  • 5
  • 25
  • 31