TensorFlow Sigmoid Cross Entropy with Logits for 1D data

Question

Context

Suppose we have some 1D data (e.g. time series), where all series have fixed length l:

        # [ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11] index
example = [ 0,  1,  1,  0, 23, 22, 20, 14,  9,  2,  0,  0] # l = 12

and we want to perform semantic segmentation, with n classes:

          # [ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]    index            
labeled = [
            [ 0,  1,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0], # class 1
            [ 0,  0,  0,  0,  1,  1,  1,  1,  0,  0,  0,  0], # class 2
            [ 0,  0,  0,  0,  0,  0,  0,  1,  1,  1,  0,  0], # class 3
           #[                     ...                      ],
            [ 1,  1,  1,  0,  0,  0,  0,  0,  1,  1,  1,  1], # class n
 ]

then the output for a single example has shape [n, l] (i.e. the data_format is not "channels_last") and the batched output has shape [b, n, l], where b is the number of examples in the batch.

These classes are independent, so it is my understanding that the use sigmoid cross entropy is applicable here as the loss rather than softmax cross entropy.

Question

I have a few small related questions in regards to the expected format for and use of tf.nn.sigmoid_cross_entropy_with_logits:

since the network outputs a tensor in the same shape as the batched labels, should I train the network under the assumption that it outputs logits, or take the keras approach (see keras's binary_crossentropy) and assume it outputs probabilities?
given the 1d segmentation problem, should I call tf.nn.sigmoid_cross_entropy_with_logits on:
- data_format='channels_first' (as shown above), or
- data_format='channels_last' (example.T)
if I want the labels to be assigned individually per channel?
should the loss operation passed to the optimizer be:
- tf.nn.sigmoid_cross_entropy_with_logits(labels, logits),
- tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels, logits)), or
- tf.losses.sigmoid_cross_entropy?

Code

This Colab, highlights my confusion and demonstrates that the data_format does in fact matter..., but the documentation does not explicitly state which is expected.

Dummy data

c = 5  # number of channels (label classes)
p = 10 # number of positions ('pixels')


# data_format = 'channels_first', shape = [classes, pixels]
# 'logits' for 2 examples
pred_1 = np.array([[random.random() for v in range(p)]for n in range(c)]).astype(float)
pred_2 = np.array([[random.random() for v in range(p)]for n in range(c)]).astype(float)

# 'ground truth' for the above 2 examples
targ_1 = np.array([[0 if random.random() < 0.8 else 1 for v in range(p)]for n in range(c)]).astype(float)
targ_2 = np.array([[0 if random.random() < 0.8 else 1 for v in range(p)]for n in range(c)]).astype(float)

# batched form of the above examples
preds = np.array([pred_1, pred_2])
targs = np.array([targ_1, targ_2])


# data_format = 'channels_last', shape = [pixels, classes]
t_pred_1 = pred_1.T
t_pred_2 = pred_2.T
t_targ_1 = targ_1.T
t_targ_2 = targ_2.T

t_preds = np.array([t_pred_1, t_pred_2])
t_targs = np.array([t_targ_1, t_targ_2])

losses

tf.nn

# calculate individual losses for 'channels_first'
loss_1 = tf.nn.sigmoid_cross_entropy_with_logits(labels=targ_1, logits=pred_1)
loss_2 = tf.nn.sigmoid_cross_entropy_with_logits(labels=targ_2, logits=pred_2)
# calculate batch loss for 'channels_first'
b_loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=targs, logits=preds)

# calculate individual losses for 'channels_last'
t_loss_1 = tf.nn.sigmoid_cross_entropy_with_logits(labels=t_targ_1, logits=t_pred_1)
t_loss_2 = tf.nn.sigmoid_cross_entropy_with_logits(labels=t_targ_2, logits=t_pred_2)
# calculate batch loss for 'channels_last'
t_b_loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=t_targs, logits=t_preds)
# get actual tensors
with tf.Session() as sess:
  # loss for 'channels_first'
  l1   = sess.run(loss_1)
  l2   = sess.run(loss_2)
  # batch loss for 'channels_first'
  bl   = sess.run(b_loss)

  # loss for 'channels_last'
  t_l1 = sess.run(t_loss_1)
  t_l2 = sess.run(t_loss_2)

  # batch loss for 'channels_last'
  t_bl = sess.run(t_b_loss)

tf.reduced_mean(tf.nn)

# calculate individual losses for 'channels_first'
rm_loss_1 = tf.reduce_mean(loss_1)
rm_loss_2 = tf.reduce_mean(loss_2)
# calculate batch loss for 'channels_first'
rm_b_loss = tf.reduce_mean(b_loss)

# calculate individual losses for 'channels_last'
rm_t_loss_1 = tf.reduce_mean(t_loss_1)
rm_t_loss_2 = tf.reduce_mean(t_loss_2)
# calculate batch loss for 'channels_last'
rm_t_b_loss = tf.reduce_mean(t_b_loss)
# get actual tensors
with tf.Session() as sess:
  # loss for 'channels_first'
  rm_l1   = sess.run(rm_loss_1)
  rm_l2   = sess.run(rm_loss_2)
  # batch loss for 'channels_first'
  rm_bl   = sess.run(rm_b_loss)

  # loss for 'channels_last'
  rm_t_l1 = sess.run(rm_t_loss_1)
  rm_t_l2 = sess.run(rm_t_loss_2)

  # batch loss for 'channels_last'
  rm_t_bl = sess.run(rm_t_b_loss)

tf.losses

# calculate individual losses for 'channels_first'
tf_loss_1 = tf.losses.sigmoid_cross_entropy(multi_class_labels=targ_1, logits=pred_1)
tf_loss_2 = tf.losses.sigmoid_cross_entropy(multi_class_labels=targ_2, logits=pred_2)
# calculate batch loss for 'channels_first'
tf_b_loss = tf.losses.sigmoid_cross_entropy(multi_class_labels=targs, logits=preds)

# calculate individual losses for 'channels_last'
tf_t_loss_1 = tf.losses.sigmoid_cross_entropy(multi_class_labels=t_targ_1, logits=t_pred_1)
tf_t_loss_2 = tf.losses.sigmoid_cross_entropy(multi_class_labels=t_targ_2, logits=t_pred_2)
# calculate batch loss for 'channels_last'
tf_t_b_loss = tf.losses.sigmoid_cross_entropy(multi_class_labels=t_targs, logits=t_preds)
# get actual tensors
with tf.Session() as sess:
  # loss for 'channels_first'
  tf_l1   = sess.run(tf_loss_1)
  tf_l2   = sess.run(tf_loss_2)
  # batch loss for 'channels_first'
  tf_bl   = sess.run(tf_b_loss)

  # loss for 'channels_last'
  tf_t_l1 = sess.run(tf_t_loss_1)
  tf_t_l2 = sess.run(tf_t_loss_2)

  # batch loss for 'channels_last'
  tf_t_bl = sess.run(tf_t_b_loss)

Test equivalency

data_format equivalency

# loss _should_(?) be the same for 'channels_first' and 'channels_last' data_format
# test example_1
e1 = (l1 == t_l1.T).all()
# test example 2
e2 = (l2 == t_l2.T).all()

# loss calculated for each example and then batched together should be the same 
# as the loss calculated on the batched examples
ea = (np.array([l1, l2]) == bl).all()
t_ea = (np.array([t_l1, t_l2]) == t_bl).all()

# loss calculated on the batched examples for 'channels_first' should be the same
# as loss calculated on the batched examples for 'channels_last'
eb = (bl == np.transpose(t_bl, (0, 2, 1))).all()


e1, e2, ea, t_ea, eb
# (True, False, False, False, True) <- changes every time, so True is happenstance

equivalency between tf.reduce_mean and tf.losses

l_e1 = tf_l1 == rm_l1
l_e2 = tf_l2 == rm_l2
l_eb = tf_bl == rm_bl

l_t_e1 = tf_t_l1 == rm_t_l1
l_t_e2 = tf_t_l2 == rm_t_l2
l_t_eb = tf_t_bl == rm_t_bl

l_e1, l_e2, l_eb, l_t_e1, l_t_e2, l_t_eb
# (False, False, False, False, False, False)

I think [this answer](https://stackoverflow.com/a/47238223/2099607) might help you. — today, Dec 04 '18 at 15:17
@today I read that answer prior, but it still isn’t quite clear to me as the dimension of independence isn’t explicitly demonstrated, and my results in the Colab differ from what that answer suggests — SumNeuron, Dec 04 '18 at 15:19

today · Accepted Answer · 2018-12-04T16:37:53.560

3

Both tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(...)) and tf.losses.sigmoid_cross_entropy(...) (with default arguments) are computing the same thing. The problem is in your tests where you use == to compare two floating-point numbers. Instead, use np.isclose method to check whether two floating-point numbers are equal or not:

# loss _should_(?) be the same for 'channels_first' and 'channels_last' data_format
# test example_1
e1 = np.isclose(l1, t_l1.T).all()
# test example 2
e2 = np.isclose(l2, t_l2.T).all()

# loss calculated for each example and then batched together should be the same 
# as the loss calculated on the batched examples
ea = np.isclose(np.array([l1, l2]), bl).all()
t_ea = np.isclose(np.array([t_l1, t_l2]), t_bl).all()

# loss calculated on the batched examples for 'channels_first' should be the same
# as loss calculated on the batched examples for 'channels_last'
eb = np.isclose(bl, np.transpose(t_bl, (0, 2, 1))).all()


e1, e2, ea, t_ea, eb
# (True, True, True, True, True)

And:

l_e1 = np.isclose(tf_l1, rm_l1)
l_e2 = np.isclose(tf_l2, rm_l2)
l_eb = np.isclose(tf_bl, rm_bl)

l_t_e1 = np.isclose(tf_t_l1, rm_t_l1)
l_t_e2 = np.isclose(tf_t_l2, rm_t_l2)
l_t_eb = np.isclose(tf_t_bl, rm_t_bl)

l_e1, l_e2, l_eb, l_t_e1, l_t_e2, l_t_eb
# (True, True, True, True, True, True)

edited Dec 04 '18 at 16:37

answered Dec 04 '18 at 16:20

today

32,602
8
95
115

ah, that makes sense... so then why would the `data_format` not matter, if each class should be independent? or is it each item in each class is independent? – SumNeuron Dec 04 '18 at 16:54
@SumNeuron The sigomoid and cross-entropy loss are computed for each individual element separately (that's because of your presumption: each element may belong to multiple classes and therefore classes are independent). Therefore, the `data_format` does not matter here. – today Dec 04 '18 at 17:25
,just to clarify, should there be an activation function, prior to the sigmoid_cross_entropy_loss? or should there be two output nodes of the graph? one which computes the loss with sigmoid cross entropy and one that returns the sigmoid of the output layer? – SumNeuron Dec 05 '18 at 13:02
@SumNeuron Both `tf.nn.sigmoid_cross_entropy_with_logits` and `tf.losses.sigmoid_cross_entropy` first apply sigmoid (that's why they assume logits as input) and then computes the cross-entropy loss. Therefore you should not apply sigmoid separately. – today Dec 05 '18 at 13:30
ok now I am a tad confused (and sorry for this). I was reading maxim's answers and he states that the output of the network are considered to be "logits" (whereas Keras assumes probabilities by default). If that is the case, then I pass the output layer to sigmoid C.E. to calculate the loss, but can I not add another output node to the graph which returns the probability? – SumNeuron Dec 05 '18 at 13:36
@SumNeuron Of course you can have another layer which applies sigmoid to get the probabilities. However, I meant that you should not pass the output of that layer to those two functions which applies sigmoid itself and compute the loss. Otherwise, the computed gradients will be wrong. – today Dec 05 '18 at 13:39
@SumNeuron Yeah, that's right. And you are welcome :) – today Dec 05 '18 at 14:06