Decision tree using continuous variable

Question

I have a question about Decision tree using continuous variable

I heard that when output variable is continuous and input variable is categorical, split criteria is reducing variance or something. but I don't know how it work if input variable is continuous

input variable : continuous / output variable : categorical
input variable : continuous / output variable : continuous

About two cases, how we can get a split criteria like gini index or information gain?

When I use rpart in R, whatever input variable and output variable are it works well, but I don't know the algorithm in detail.

This is not a technical quesion: consider posting in communities cross-validated or datascience. — Eric Lecoutre, Nov 30 '16 at 13:52
I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory/methodology. — desertnaut, Jul 31 '21 at 13:45

score 11 · Answer 1 · answered Nov 30 '16 at 14:35

1) input variable : continuous / output variable : categorical
C4.5 algorithm solve this situation. C4.5

In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it.

2) input variable : continuous / output variable : continuous
CART(classification and regression trees) algorithm solves this situation. CART

Case 2 is the regression problem. You should enumerate the attribute j, and enumerate the values s in that attribute, and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it. Then you get two areas

Find the best attribute j and the best split value s, which

c_1 and c_2 and be solved as follows:

Then when do regression,

where

score 4 · Answer 2 · answered Oct 23 '19 at 00:02

I can explain the concept at a very high level.

The main goal of the algorithm is to find an attribute that we will use for the first split. We can use various impurity metrics to evaluate the most significant attribute. Those impurity metrics can be Information Gain, Entropy, Gain Ratio, etc. But, if the decision variable is a continuous type variable, then we usually use another impurity metric 'standard deviation reduction'. But, whatever metric you use, depending on your algorithm (i.e. ID3, C4.5, etc) you actually find an attribute that will be used for splitting.

When you have a continuous type attribute, then things get a little tricky. You need to find a threshold value for an attribute that will give you the highest impurity (Entropy, Gain Ratio, Information Gain ... whatever). Then, you find which attribute's threshold value gives that highest impurity, and then chose an attribute accordingly, right?

Now, if the attribute is a continuous type and decision variable is also continuous type, then you can simply combine the above two concepts and generate the Regression Tree.

That means, as the decision variable is continuous type, you will use the metric (like Variance reduction) and chose the attribute which will give you the highest value of the chosen metric (i.e. variance reduction) for the threshold value of all attributes.

You can visualize such a regression tree using a Decision Tree Machine Learning software like SpiceLogic Decision Tree Software Say, you have a data table like this:

The software will generate the Regression tree like this:

*"You need to find a threshold value for an attribute that will give you the highest impurity"* How would you do that though? There is an infinite number of thresholds in the case of continuous variables, and the metrics are not differentiable with respects to the threshold. — Mehdi Charife, Apr 13 '23 at 04:25

Decision tree using continuous variable

2 Answers2

Linked