0

I have a general question that's probably not fit for Stack Overflow. Apologies in advance.

In all the online articles they display these graphs. I understand Gini is used in entropy. How are values in the first line after <= generated?

The first decision node says petal length (cm) <= 2.45. I understand its literal meaning. I don't understand how it's derived. Petal lengths less than or equal to 2.45 seem like an arbitrary value. And doesn't make sense when the following false path decision node is petal length less than or equal to 1.75.

Extra credit: a good explanation of samples, value and class

enter image description here

Thanks!

Source: https://medium.com/geekculture/criterion-used-in-constructing-decision-tree-c89b7339600f

merv
  • 67,214
  • 13
  • 180
  • 245
luckyging3r
  • 3,047
  • 3
  • 18
  • 37
  • I think you are looking for https://ai.stackexchange.com/questions/tagged/machine-learning – RobertoT May 26 '22 at 15:54
  • 1
    However a fast answer, a decision tree algorithm is a mathematic algorithm which is fitted to discover "values" that act as boundaries to divide samples of your data among target classes. In this case, the tree is saying if the petal length is longer than 2.45 cm is setosa, if it is wider than 1.75 it is virginica. If not, it is versicolor – RobertoT May 26 '22 at 15:56
  • 3
    The right branch doesn't look at petal length it looks at petal *width* – 0x263A May 26 '22 at 15:58
  • You may want to look at https://stackoverflow.com/questions/40889344/decision-tree-using-continuous-variable – ThSorn May 27 '22 at 18:39

1 Answers1

0

Decision trees have two type of nodes: leaf nodes and branch nodes. Branch nodes contain the splitting condition, leaf nodes give you the result of your classification/regression.

gini is used in entropy statement is incorrect. Gini and entropy are metrics, they are used to measure the information gain when performing a split on a condition. Decision tree then leaves the split that resulted in highest information gain.

classes is the label of your data points (in this case there are 3 labels for 3 different flower species).

samples indicates how many data points ended up in that node.

value is a vector which represents how the data points are distributed in terms of their classes. For instance, [0, 49, 5] means there are 49 data points with label 1, 5 data points with label 5.

Ach113
  • 1,775
  • 3
  • 18
  • 40