Why use cross entropy in decision tree rather than 0/1 loss

Question

I know how cross entropy/mutual information works in classification decision as a loss function. But I wonder why 0/1 loss is not a good choice.

Related thread (but from a neural network perspective): [Cost function training target versus accuracy desired goal](https://stackoverflow.com/questions/47891197/cost-function-training-target-versus-accuracy-desired-goal/47910243#47910243) — desertnaut, May 15 '18 at 11:15

ely · Accepted Answer · 2018-05-13T01:36:43.010

In the context of general machine learning, the primary reason 0-1 loss is seldom used is that 0-1 loss is not a convex loss function, and also is not differentiable at 0. It turns out to be NP-hard to solve a problem exactly with regard to 0-1 loss. Here is a source discussing some methods for direct optimization of the 0-1 loss.

Cross entropy can be understood as a relaxation of 0-1 loss in a way that represents the same general idea (attributing "success" to a candidate classification based on the degree to which it predicts the correct label for that example), but which is convex.

In the specific context of decision trees, which you mention in the title, there are at least two important considerations related to this.

In vanilla decision tree training, the criteria used for modifying the parameters of the model (the decision splits) is some measure of classification purity like information gain or gini impurity, both of which represent something different than standard cross entropy in the setup of a classification problem. You actually can use 0-1 loss for the splitting criterion here, which is also known as using the misclassification rate. Here are some PDF lecture notes where on slide 19 they show a nice plot of the smooth functions for information gain and gini impurity contrasted with the sharp point of non differentiability for misclassification rate.
In gradient boosted trees you once again need a differentiable loss function, which mostly is discussed in the context of regression trees using mean-squared error, and which usually refers to either deviance loss or "exponential" (AdaBoost) loss for classification, but which in principle could use cross entropy in some customized way.

For problems that drastically benefit from a convex or at least differentiable loss function, such as training a neural network based classifier, the benefits for using a relaxation like cross entropy are usually quite huge, and there often is not much practical value in fully optimizing the 0-1 loss.

For a plain decision tree, where you can use the 0-1 loss to calculate an accuracy metric at each proposed split, you are not dealing with the same NP-hard optimization problem, rather you are just using 0-1 loss as the splitting criteria and still just searching over the f-by-d number of possible splits of f features each with d observed values.

I'm sure you can make some hand-wavy arguments that information gain or gini impurity allow for more nuanced interpretations about the informativeness of a given feature split, or perhaps with more credibility you could argue that purely optimizing raw classification accuracy at each split could lead to bad overfitting, especially with greedy methods.

But in the end, there's no hard and fast reason why you could not use the 0-1 loss as the splitting criteria if you had some reason to believe it was a valuable way to go in a given modeling problem you are working on.

If 0-1 loss is not differentiable at 0, can we replace all 0-1 loss by cross entropy? — Bratt Swan, May 13 '18 at 01:17
@BrattSwan I added a fair bit of detail specifically relating this to trees, because from the title it seems like that might br=e what you're focused on. It's important to understand the difference between a typical loss function versus the criterion used for choosing a split in a decision tree. Beyond that distinction, there's no reason you couldn't use 0-1 loss as the criteria for making a split. — ely, May 13 '18 at 01:30
Regarding your comment above: for problems the require directly optimizing the loss function and which benefit a lot from (or even require) a differentiable loss function, then yes, you would almost always choose a relaxation like cross entropy. For a problem where you just need to compute score or a criterion, like what's used for splitting a tree, there could possibly be occasional reasons to prefer the 0-1 loss, but in most cases some other criterion, like gini impurity or information gain, is used. — ely, May 13 '18 at 01:32

Why use cross entropy in decision tree rather than 0/1 loss

1 Answers1