115

I know that there are a lot of explanations of what cross-entropy is, but I'm still confused.

Is it only a method to describe the loss function? Can we use gradient descent algorithm to find the minimum using the loss function?

stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217
theateist
  • 13,879
  • 17
  • 69
  • 109
  • 13
    Not a good fit for SO. Here's a similar question on the datascience sister site: http://datascience.stackexchange.com/questions/9302/the-cross-entropy-error-function-in-neural-networks – Metropolis Feb 01 '17 at 21:59
  • for a simple, non-mathematical explanation, refer to https://towardsdatascience.com/cross-entropy-classification-losses-no-math-few-stories-lots-of-intuition-d56f8c7f06b0 – Allohvk Mar 24 '21 at 16:17

3 Answers3

277

Cross-entropy is commonly used to quantify the difference between two probability distributions. In the context of machine learning, it is a measure of error for categorical multi-class classification problems. Usually the "true" distribution (the one that your machine learning algorithm is trying to match) is expressed in terms of a one-hot distribution.

For example, suppose for a specific training instance, the true label is B (out of the possible labels A, B, and C). The one-hot distribution for this training instance is therefore:

Pr(Class A)  Pr(Class B)  Pr(Class C)
        0.0          1.0          0.0

You can interpret the above true distribution to mean that the training instance has 0% probability of being class A, 100% probability of being class B, and 0% probability of being class C.

Now, suppose your machine learning algorithm predicts the following probability distribution:

Pr(Class A)  Pr(Class B)  Pr(Class C)
      0.228        0.619        0.153

How close is the predicted distribution to the true distribution? That is what the cross-entropy loss determines. Use this formula:

Cross entropy loss formula

Where p(x) is the true probability distribution (one-hot) and q(x) is the predicted probability distribution. The sum is over the three classes A, B, and C. In this case the loss is 0.479 :

H = - (0.0*ln(0.228) + 1.0*ln(0.619) + 0.0*ln(0.153)) = 0.479

Logarithm base

Note that it does not matter what logarithm base you use as long as you consistently use the same one. As it happens, the Python Numpy log() function computes the natural log (log base e).

Python code

Here is the above example expressed in Python using Numpy:

import numpy as np

p = np.array([0, 1, 0])             # True probability (one-hot)
q = np.array([0.228, 0.619, 0.153]) # Predicted probability

cross_entropy_loss = -np.sum(p * np.log(q))
print(cross_entropy_loss)
# 0.47965000629754095

So that is how "wrong" or "far away" your prediction is from the true distribution. A machine learning optimizer will attempt to minimize the loss (i.e. it will try to reduce the loss from 0.479 to 0.0).

Loss units

We see in the above example that the loss is 0.4797. Because we are using the natural log (log base e), the units are in nats, so we say that the loss is 0.4797 nats. If the log were instead log base 2, then the units are in bits. See this page for further explanation.

More examples

To gain more intuition on what these loss values reflect, let's look at some extreme examples.

Again, let's suppose the true (one-hot) distribution is:

Pr(Class A)  Pr(Class B)  Pr(Class C)
        0.0          1.0          0.0

Now suppose your machine learning algorithm did a really great job and predicted class B with very high probability:

Pr(Class A)  Pr(Class B)  Pr(Class C)
      0.001        0.998        0.001

When we compute the cross entropy loss, we can see that the loss is tiny, only 0.002:

p = np.array([0, 1, 0])
q = np.array([0.001, 0.998, 0.001])
print(-np.sum(p * np.log(q)))
# 0.0020020026706730793

At the other extreme, suppose your ML algorithm did a terrible job and predicted class C with high probability instead. The resulting loss of 6.91 will reflect the larger error.

Pr(Class A)  Pr(Class B)  Pr(Class C)
      0.001        0.001        0.998
p = np.array([0, 1, 0])
q = np.array([0.001, 0.001, 0.998])
print(-np.sum(p * np.log(q)))
# 6.907755278982137

Now, what happens in the middle of these two extremes? Suppose your ML algorithm can't make up its mind and predicts the three classes with nearly equal probability.

Pr(Class A)  Pr(Class B)  Pr(Class C)
      0.333        0.333        0.334

The resulting loss is 1.10.

p = np.array([0, 1, 0])
q = np.array([0.333, 0.333, 0.334])
print(-np.sum(p * np.log(q)))
# 1.0996127890016931

Fitting into gradient descent

Cross entropy is one out of many possible loss functions (another popular one is SVM hinge loss). These loss functions are typically written as J(theta) and can be used within gradient descent, which is an iterative algorithm to move the parameters (or coefficients) towards the optimum values. In the equation below, you would replace J(theta) with H(p, q). But note that you need to compute the derivative of H(p, q) with respect to the parameters first.

gradient descent

So to answer your original questions directly:

Is it only a method to describe the loss function?

Correct, cross-entropy describes the loss between two probability distributions. It is one of many possible loss functions.

Then we can use, for example, gradient descent algorithm to find the minimum.

Yes, the cross-entropy loss function can be used as part of gradient descent.

Further reading: one of my other answers related to TensorFlow.

stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217
  • so, cross-entropy describes the loss by sum of probabilities for each example X. – theateist Feb 01 '17 at 22:34
  • so, can we instead of describing the error as cross-entropy, describe the error as an angle between two vectors (cosine similarity/ angular distance) and try to minimize the angle? – theateist Feb 01 '17 at 22:55
  • @theateist: Using cosine (dis)similarity is usually not used. – stackoverflowuser2010 Feb 02 '17 at 01:31
  • 1
    apparently it's not the best solution, but I just wanted to know, in theory, if we could use `cosine (dis)similarity` to describe the error through the angle and then try to minimize the angle. – theateist Feb 02 '17 at 17:22
  • @stackoverflowuser2010 We don't use the cosine similarity because it's error surface is not smooth right? If we reach a cross entropy of zero, does that also mean that the cosine similarity is 1? – nadre Jul 28 '17 at 18:36
  • @nadre: I don't know the answer to the first question. For the second question: cross-entropy is generally used to compare distributions (and if two distributions are the same, you'd get a cross-entropy of 0.0). Cosine similarity is generally used to compare vectors, or features expressed as vectors (and if two vectors are the same, you'd get a cosine of 1.0). – stackoverflowuser2010 Jul 28 '17 at 19:10
  • @stackoverflowuser2010 but informally speaking a distribution is just a vector that sums up to one, so if we compare two "distribution vectors" that have a cross-entropy of 0 (they are identical), that should also mean that their cosine similarity is 1 or am I missing a point? – nadre Jul 28 '17 at 19:14
  • @nadre: Yes, you are correct. But generally, you don't see cosine similarity being used in that manner in the literature. – stackoverflowuser2010 Jul 28 '17 at 21:00
  • Just to check my understanding, in the equation for H(p,q), what are p(x) and q(x)? I think that p(x) (which would actually be P(A) or P(B) etc once x is filled in) is the actual probability of the class for a training example, and would thus always be 0 or 1. Whereas q(x) is the predicted probability for the class. So then in the problem as stated we would basically be summing over 3 values for each training example, and if we had 1000 training examples we'd repeat that equation 1000 times for one iteration of gradient descent. Do I have that all right? – Stephen Oct 20 '17 at 22:14
  • 2
    @Stephen: If you look at the example I gave, `p(x)` would be the list of ground-truth probabilities for each of the classes, which would be `[0.0, 1.0, 0.0`. Likewise, `q(x)` is the list of predicted probability for each of the classes, `[0.228, 0.619, 0.153]`. `H(p, q)` is then `- (0 * log(2.28) + 1.0 * log(0.619) + 0 * log(0.153))`, which comes out to be 0.479. Note that it's common to use Python's `np.log()` function, which is actually the natural log; it doesn't matter. – stackoverflowuser2010 Oct 20 '17 at 23:02
  • @stackoverflowuser2010 Maybe I am having a slow evening but should the cross entropy not be ~0.691? `H(p, q)` should be `- ((0 * log2(0.228)) + (1.0 * log2(0.619)) + (0 * log2(0.153)))`, which would give 0.691 right? I do notice, that you have 2.28 in your first log. – oreid Nov 28 '17 at 04:45
  • @Beardo: My calculation is with natural log, while yours is with the base-2 log. In practice it doesn't matter as long as you use the same function consistently throughout. – stackoverflowuser2010 Nov 28 '17 at 23:38
  • @stackoverflowuser2010 Ah gotcha, thanks for clarifying. – oreid Nov 29 '17 at 03:24
  • So cross-entropy can only work under one-hot distribution? – 吴环宇 Jan 26 '18 at 11:59
  • @吴环宇: Yes. One-hot distribution just means that all the class probabilities are 0.0 except for one which is 1.0. Just replace those numbers with other probabilities that sum up to 1.0. – stackoverflowuser2010 Jan 26 '18 at 20:27
  • @stackoverflowuser2010 Let's say I got a true distribution of 0.2 0.3.0.5, then I somehow figure out the true distribution through my model and some excellent algorithm. I calculate its cross-entropy and find it non-zero, but this contradicts with the understanding that cross-entropy is 'distance' between predicated distribution and true distribution? – 吴环宇 Jan 27 '18 at 01:09
  • @吴环宇: If the loss is 0.0, then you have a perfect match. If you have a result > 0.0, then they are different. If that is the case, then that means your distribution differs from the true distribution. – stackoverflowuser2010 Jan 27 '18 at 02:25
  • It may sound silly, but why do we do the calculations for the other classes (in one-hot encoding), when we know that we're only interested in for the probability distribution of the true class atm? `ln(0.619)` results the same. – HAr Feb 13 '18 at 11:49
  • 1
    @HAr: For one-hot encoding of the true label, there is only one non-zero class that we care about. However, cross-entropy can compare any two probability distributions; it is not necessary that one of them has one-hot probabilities. – stackoverflowuser2010 Feb 13 '18 at 20:30
  • Is it necessary to use log operation in the formula? Without the log, the formula is simply the definition of the inner product (with a minus sign) of the two probability distributions: one-hot encoded ground-truth label and the predicted soft-max distribution; greater inner product means better match of the two distributions, and with a minus sign we get a small loss, on the other hand smaller inner product means worser match, and with a minus sign we get a larger loss. So it is still a valid way of loss computation right? The only difference is that we now get negative value as loss. – Francis May 24 '18 at 12:57
  • @Francis, I believe what you are describing is almost exactly the Hinge Loss (that is actually 1 - inner_product). So, yes, it could be one loss function, as you describe it, but the one in the original question is one particular loss, the cross-entropy, and the entropy, the same as in information theory and physics, always has the logarithmic. – Jblasco Aug 08 '18 at 16:09
  • best answer on cross entropy everrrrrrrrrrrrr – user1906450 Feb 28 '19 at 13:05
  • Do we really need to do the summation with one hot encoded vector? Can't we just use: `-log(prediction that corresponds to label 1)`, since logs of other predictions will be multiplied by zeros. – Akavall Dec 10 '19 at 06:35
  • @Akavall: If you are using cross-entropy loss for tasks where `p(x)` is one-hot, then yes, you can compute only `log q(x)` for the class that corresponds to the true label. However, cross-entropy can be used to compare any two probability distributions even if neither one is one-hot. – stackoverflowuser2010 Jan 03 '20 at 16:50
5

In short, cross-entropy(CE) is the measure of how far is your predicted value from the true label.

The cross here refers to calculating the entropy between two or more features / true labels (like 0, 1).

And the term entropy itself refers to randomness, so large value of it means your prediction is far off from real labels.

So the weights are changed to reduce CE and thus finally leads to reduced difference between the prediction and true labels and thus better accuracy.

Harsh Malra
  • 61
  • 1
  • 2
3

Adding to the above posts, the simplest form of cross-entropy loss is known as binary-cross-entropy (used as loss function for binary classification, e.g., with logistic regression), whereas the generalized version is categorical-cross-entropy (used as loss function for multi-class classification problems, e.g., with neural networks).

The idea remains the same:

  1. when the model-computed (softmax) class-probability becomes close to 1 for the target label for a training instance (represented with one-hot-encoding, e.g.,), the corresponding CCE loss decreases to zero

  2. otherwise it increases as the predicted probability corresponding to the target class becomes smaller.

The following figure demonstrates the concept (notice from the figure that BCE becomes low when both of y and p are high or both of them are low simultaneously, i.e., there is an agreement):

enter image description here

Cross-entropy is closely related to relative entropy or KL-divergence that computes distance between two probability distributions. For example, in between two discrete pmfs, the relation between them is shown in the following figure:

enter image description here

Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63