3

Let's say I want to solve a multi-label problem using neural networks and Keras.

The outputs are typically of the form y=[0, 1, 0, 1, 0, 0], and it's easily possible to train a network using binary cross entropy and sigmoids for the outputs (e.g. see code below).

from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(6, activation='relu')#Add 1 hidden layer
                                     #with 6 neurons, with relu activation
model.add(Dense(6, activation='sigmoid'))#Here we specify that we have 6 outputs
                                         #and we want outputs to be in [0,1]
model.compile(optimizer='Adam', loss='binary_crossentropy')
model.fit(xtrain, ytrain, batch_size=128)

When I do the fit on the last line, what really happens implementation-wise?

  1. Is the network updated multiple times? One time after computing the error of each of the 6 outputs, propagating it back to upgrade weights?

  2. Does it compute the error for each of the outputs separately, and then make one overall update of the network?

Edit: updated question after Daniel Möller answer

model.fit(xtrain, ytrain, batch_size=1)

My question is probably clearer with batch_size of size 1.

At each iteration, we pick 1 example from the training set and feed-forward. Then, we compute the error made on each output. In this case, the question are the following:

For the weights that are not shared across outputs (the weights from the hidden layer to the outputs), are they updated based on the error made by the model computed as the sum of the error on ALL outputs, or just by one specific output?

Is the model weights updated based on the sum of the error once or is the model updated multiple times, based on individual errors made on all outputs?

Maxim
  • 52,561
  • 27
  • 155
  • 209
sloan
  • 319
  • 1
  • 3
  • 9

2 Answers2

1

For all effects, it should be seen as a huge matrix operation.

It will update the network once every batch is processed. So, neither 1 nor 2.

Its: 3 - It computes the error for the entire batch at once, as a matrix operation, and then make one overall update on all the weight matrices. But it will be multiple updates as you will have multiple batches of size 128.

Y is usually the form:

[
    [1,0,0,1,0,0],
    [1,0,0,1,0,0],
    [0,0,0,1,1,0],
    [1,0,1,1,0,0]
]

A batch of outputs.


Whether internally it does loops or whatever thing necessary to do the matrix calculations, it's invisible and unnaccessible to us.

Daniel Möller
  • 84,878
  • 18
  • 192
  • 214
0

I'd like to add to Daniel's answer that binary_crossentropy corresponds to tf.nn.sigmoid_cross_entropy_with_logits actual operation in tensorflow, which indeed computes a single scalar for all labels (see this question for details). Individual losses are never actually computed, tensorflow uses the formula that calculates the sum directly.

Here's the source code:

def binary_crossentropy(target, output, from_logits=False):
  """Binary crossentropy between an output tensor and a target tensor.

  Arguments:
      target: A tensor with the same shape as `output`.
      output: A tensor.
      from_logits: Whether `output` is expected to be a logits tensor.
          By default, we consider that `output`
          encodes a probability distribution.

  Returns:
      A tensor.
  """
  # Note: nn.softmax_cross_entropy_with_logits
  # expects logits, Keras expects probabilities.
  if not from_logits:
    # transform back to logits
    epsilon_ = _to_tensor(epsilon(), output.dtype.base_dtype)
    output = clip_ops.clip_by_value(output, epsilon_, 1 - epsilon_)
    output = math_ops.log(output / (1 - output))
  return nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output)

So all gradient updates are based on this reduced loss value. Theano T.nnet.binary_crossentropy function and CNTK are the same.

Maxim
  • 52,561
  • 27
  • 155
  • 209