9

I have a question about Decision tree using continuous variable

I heard that when output variable is continuous and input variable is categorical, split criteria is reducing variance or something. but I don't know how it work if input variable is continuous

  1. input variable : continuous / output variable : categorical

  2. input variable : continuous / output variable : continuous

About two cases, how we can get a split criteria like gini index or information gain?

When I use rpart in R, whatever input variable and output variable are it works well, but I don't know the algorithm in detail.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
BSKim
  • 91
  • 1
  • 1
  • 2
  • 1
    This is not a technical quesion: consider posting in communities cross-validated or datascience. – Eric Lecoutre Nov 30 '16 at 13:52
  • 2
    I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory/methodology. – desertnaut Jul 31 '21 at 13:45

2 Answers2

11

1) input variable : continuous / output variable : categorical
C4.5 algorithm solve this situation. C4.5

In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it.

2) input variable : continuous / output variable : continuous
CART(classification and regression trees) algorithm solves this situation. CART

Case 2 is the regression problem. You should enumerate the attribute j, and enumerate the values s in that attribute, and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it. Then you get two areas enter image description here

Find the best attribute j and the best split value s, which

enter image description here

c_1 and c_2 and be solved as follows:

enter image description here

Then when do regression,
enter image description here

where

enter image description here

Vito
  • 426
  • 3
  • 11
4

I can explain the concept at a very high level.

The main goal of the algorithm is to find an attribute that we will use for the first split. We can use various impurity metrics to evaluate the most significant attribute. Those impurity metrics can be Information Gain, Entropy, Gain Ratio, etc. But, if the decision variable is a continuous type variable, then we usually use another impurity metric 'standard deviation reduction'. But, whatever metric you use, depending on your algorithm (i.e. ID3, C4.5, etc) you actually find an attribute that will be used for splitting.

When you have a continuous type attribute, then things get a little tricky. You need to find a threshold value for an attribute that will give you the highest impurity (Entropy, Gain Ratio, Information Gain ... whatever). Then, you find which attribute's threshold value gives that highest impurity, and then chose an attribute accordingly, right?

Now, if the attribute is a continuous type and decision variable is also continuous type, then you can simply combine the above two concepts and generate the Regression Tree.

That means, as the decision variable is continuous type, you will use the metric (like Variance reduction) and chose the attribute which will give you the highest value of the chosen metric (i.e. variance reduction) for the threshold value of all attributes.

You can visualize such a regression tree using a Decision Tree Machine Learning software like SpiceLogic Decision Tree Software Say, you have a data table like this:

enter image description here

The software will generate the Regression tree like this:

enter image description here

Emran Hussain
  • 11,551
  • 5
  • 41
  • 48
  • *"You need to find a threshold value for an attribute that will give you the highest impurity"* How would you do that though? There is an infinite number of thresholds in the case of continuous variables, and the metrics are not differentiable with respects to the threshold. – Mehdi Charife Apr 13 '23 at 04:25