What's the difference between sparse_softmax_cross_entropy_with_logits and softmax_cross_entropy_with_logits?

Question

I recently came across tf.nn.sparse_softmax_cross_entropy_with_logits and I can not figure out what the difference is compared to tf.nn.softmax_cross_entropy_with_logits.

Is the only difference that training vectors y have to be one-hot encoded when using sparse_softmax_cross_entropy_with_logits?

Reading the API, I was unable to find any other difference compared to softmax_cross_entropy_with_logits. But why do we need the extra function then?

Shouldn't softmax_cross_entropy_with_logits produce the same results as sparse_softmax_cross_entropy_with_logits, if it is supplied with one-hot encoded training data/vectors?

I'm interested in seeing a comparison of their performance if both can be used (e.g. with exclusive image labels); I'd expect the sparse version to be more efficient, at least memory-wise. — Yibo Yang, Jun 07 '17 at 20:17
See also [this question](https://stackoverflow.com/q/47034888/712995), which discusses *all cross-entropy functions* in tensorflow (turns out there are lots of them). — Maxim, Nov 11 '17 at 15:26

Olivier Moindrot · Accepted Answer · 2017-03-22T07:43:33.537

186

Having two different functions is a convenience, as they produce the same result.

The difference is simple:

For sparse_softmax_cross_entropy_with_logits, labels must have the shape [batch_size] and the dtype int32 or int64. Each label is an int in range [0, num_classes-1].
For softmax_cross_entropy_with_logits, labels must have the shape [batch_size, num_classes] and dtype float32 or float64.

Labels used in softmax_cross_entropy_with_logits are the one hot version of labels used in sparse_softmax_cross_entropy_with_logits.

Another tiny difference is that with sparse_softmax_cross_entropy_with_logits, you can give -1 as a label to have loss 0 on this label.

edited Mar 22 '17 at 07:43

answered May 19 '16 at 08:03

Olivier Moindrot

27,908
11
92
91

17

Is the -1 correct? As the documentation reads: "Each entry in labels must be an index in [0, num_classes). Other values will raise an exception when this op is run on CPU, and return NaN for corresponding loss and gradient rows on GPU." – Reddspark Aug 13 '17 at 05:32
3

[0, num_classes) = [0, num_classes-1] – Karthik C Mar 10 '19 at 18:47
1

Is this statement correct? "Labels used in softmax_cross_entropy_with_logits are the one hot version of labels used in sparse_softmax_cross_entropy_with_logits." Is it backwards? Isn't the sparse loss function the one with int of 0, so isn't the sparse one the one-hot version? – brianlen Sep 22 '20 at 02:15

score 28 · Answer 2 · edited Jun 20 '20 at 09:12

28

I would just like to add 2 things to accepted answer that you can also find in TF documentation.

First:

tf.nn.softmax_cross_entropy_with_logits

NOTE: While the classes are mutually exclusive, their probabilities need not be. All that is required is that each row of labels is a valid probability distribution. If they are not, the computation of the gradient will be incorrect.

Second:

tf.nn.sparse_softmax_cross_entropy_with_logits

NOTE: For this operation, the probability of a given label is considered exclusive. That is, soft classes are not allowed, and the labels vector must provide a single specific index for the true class for each row of logits (each minibatch entry).

edited Jun 20 '20 at 09:12

Community

1
1

answered Jun 29 '16 at 13:57

Drag0

8,438
9
40
52

4

What should we use if the classes are not mutually exclusive. I mean if we're combining multiple categorical labels? – Hayro Feb 23 '17 at 03:03
I also read this. So it means we apply the class probability on the cross entropy rather than taking it as a onehot vector. – Shamane Siriwardhana Mar 20 '17 at 07:26
@Hayro - Do you mean you are unable to do one hot encoding? I think you would have to look at a different model. [This](http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression) mentioned something like " it would be more appropriate to build 4 binary logistic regression classifiers" To first make sure you can separate the classes. – ashley May 19 '17 at 14:50

score 22 · Answer 3 · edited May 23 '17 at 11:33

Both functions computes the same results and sparse_softmax_cross_entropy_with_logits computes the cross entropy directly on the sparse labels instead of converting them with one-hot encoding.

You can verify this by running the following program:

import tensorflow as tf
from random import randint

dims = 8
pos  = randint(0, dims - 1)

logits = tf.random_uniform([dims], maxval=3, dtype=tf.float32)
labels = tf.one_hot(pos, dims)

res1 = tf.nn.softmax_cross_entropy_with_logits(       logits=logits, labels=labels)
res2 = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=tf.constant(pos))

with tf.Session() as sess:
    a, b = sess.run([res1, res2])
    print a, b
    print a == b

Here I create a random logits vector of length dims and generate one-hot encoded labels (where element in pos is 1 and others are 0).

After that I calculate softmax and sparse softmax and compare their output. Try rerunning it a few times to make sure that it always produce the same output

What's the difference between sparse_softmax_cross_entropy_with_logits and softmax_cross_entropy_with_logits?

3 Answers3

Linked