126

I recently came across tf.nn.sparse_softmax_cross_entropy_with_logits and I can not figure out what the difference is compared to tf.nn.softmax_cross_entropy_with_logits.

Is the only difference that training vectors y have to be one-hot encoded when using sparse_softmax_cross_entropy_with_logits?

Reading the API, I was unable to find any other difference compared to softmax_cross_entropy_with_logits. But why do we need the extra function then?

Shouldn't softmax_cross_entropy_with_logits produce the same results as sparse_softmax_cross_entropy_with_logits, if it is supplied with one-hot encoded training data/vectors?

nbro
  • 15,395
  • 32
  • 113
  • 196
daniel451
  • 10,626
  • 19
  • 67
  • 125
  • 2
    I'm interested in seeing a comparison of their performance if both can be used (e.g. with exclusive image labels); I'd expect the sparse version to be more efficient, at least memory-wise. – Yibo Yang Jun 07 '17 at 20:17
  • 2
    See also [this question](https://stackoverflow.com/q/47034888/712995), which discusses *all cross-entropy functions* in tensorflow (turns out there are lots of them). – Maxim Nov 11 '17 at 15:26

3 Answers3

186

Having two different functions is a convenience, as they produce the same result.

The difference is simple:

  • For sparse_softmax_cross_entropy_with_logits, labels must have the shape [batch_size] and the dtype int32 or int64. Each label is an int in range [0, num_classes-1].
  • For softmax_cross_entropy_with_logits, labels must have the shape [batch_size, num_classes] and dtype float32 or float64.

Labels used in softmax_cross_entropy_with_logits are the one hot version of labels used in sparse_softmax_cross_entropy_with_logits.

Another tiny difference is that with sparse_softmax_cross_entropy_with_logits, you can give -1 as a label to have loss 0 on this label.

Olivier Moindrot
  • 27,908
  • 11
  • 92
  • 91
  • 17
    Is the -1 correct? As the documentation reads: "Each entry in labels must be an index in [0, num_classes). Other values will raise an exception when this op is run on CPU, and return NaN for corresponding loss and gradient rows on GPU." – Reddspark Aug 13 '17 at 05:32
  • 3
    [0, num_classes) = [0, num_classes-1] – Karthik C Mar 10 '19 at 18:47
  • 1
    Is this statement correct? "Labels used in softmax_cross_entropy_with_logits are the one hot version of labels used in sparse_softmax_cross_entropy_with_logits." Is it backwards? Isn't the sparse loss function the one with int of 0, so isn't the sparse one the one-hot version? – brianlen Sep 22 '20 at 02:15
28

I would just like to add 2 things to accepted answer that you can also find in TF documentation.

First:

tf.nn.softmax_cross_entropy_with_logits

NOTE: While the classes are mutually exclusive, their probabilities need not be. All that is required is that each row of labels is a valid probability distribution. If they are not, the computation of the gradient will be incorrect.

Second:

tf.nn.sparse_softmax_cross_entropy_with_logits

NOTE: For this operation, the probability of a given label is considered exclusive. That is, soft classes are not allowed, and the labels vector must provide a single specific index for the true class for each row of logits (each minibatch entry).

Community
  • 1
  • 1
Drag0
  • 8,438
  • 9
  • 40
  • 52
  • 4
    What should we use if the classes are not mutually exclusive. I mean if we're combining multiple categorical labels? – Hayro Feb 23 '17 at 03:03
  • I also read this. So it means we apply the class probability on the cross entropy rather than taking it as a onehot vector. – Shamane Siriwardhana Mar 20 '17 at 07:26
  • @Hayro - Do you mean you are unable to do one hot encoding? I think you would have to look at a different model. [This](http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression) mentioned something like " it would be more appropriate to build 4 binary logistic regression classifiers" To first make sure you can separate the classes. – ashley May 19 '17 at 14:50
22

Both functions computes the same results and sparse_softmax_cross_entropy_with_logits computes the cross entropy directly on the sparse labels instead of converting them with one-hot encoding.

You can verify this by running the following program:

import tensorflow as tf
from random import randint

dims = 8
pos  = randint(0, dims - 1)

logits = tf.random_uniform([dims], maxval=3, dtype=tf.float32)
labels = tf.one_hot(pos, dims)

res1 = tf.nn.softmax_cross_entropy_with_logits(       logits=logits, labels=labels)
res2 = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=tf.constant(pos))

with tf.Session() as sess:
    a, b = sess.run([res1, res2])
    print a, b
    print a == b

Here I create a random logits vector of length dims and generate one-hot encoded labels (where element in pos is 1 and others are 0).

After that I calculate softmax and sparse softmax and compare their output. Try rerunning it a few times to make sure that it always produce the same output

Community
  • 1
  • 1
Salvador Dali
  • 214,103
  • 147
  • 703
  • 753