8

I'm training a model with tensorflow 2.0. The images in my training set are of different resolutions. The Model I've built can handle variable resolutions (conv layers followed by global averaging). My training set is very small and I want to use full training set in a single batch.

Since my images are of different resolutions, I can't use model.fit(). So, I'm planning to pass each sample through the network individually, accumulate the errors/gradients and then apply one optimizer step. I'm able to compute loss values, but I don't know how to accumulate the losses/gradients. How can I accumulate the losses/gradients and then apply a single optimizer step?

Code:

for i in range(num_epochs):
    print(f'Epoch: {i + 1}')
    total_loss = 0
    for j in tqdm(range(num_samples)):
        sample = samples[j]
        with tf.GradientTape as tape:
            prediction = self.model(sample)
            loss_value = self.loss_function(y_true=labels[j], y_pred=prediction)
        gradients = tape.gradient(loss_value, self.model.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.model.trainable_variables))
        total_loss += loss_value

    epoch_loss = total_loss / num_samples
    print(f'Epoch loss: {epoch_loss}')
Nagabhushan S N
  • 6,407
  • 8
  • 44
  • 87

2 Answers2

8

If I understand correctly from this statement:

How can I accumulate the losses/gradients and then apply a single optimizer step?

@Nagabhushan is trying to accumulate gradients and then apply the optimization on the (mean) accumulated gradient. The answer provided by @TensorflowSupport does not answers it. In order to perform the optimization only once, and accumulate the gradient from several tapes, you can do the following:

for i in range(num_epochs):
    print(f'Epoch: {i + 1}')
    total_loss = 0

    # get trainable variables
    train_vars = self.model.trainable_variables
    # Create empty gradient list (not a tf.Variable list)
    accum_gradient = [tf.zeros_like(this_var) for this_var in train_vars]

    for j in tqdm(range(num_samples)):
        sample = samples[j]
        with tf.GradientTape as tape:
            prediction = self.model(sample)
            loss_value = self.loss_function(y_true=labels[j], y_pred=prediction)
        total_loss += loss_value

        # get gradients of this tape
        gradients = tape.gradient(loss_value, train_vars)
        # Accumulate the gradients
        accum_gradient = [(acum_grad+grad) for acum_grad, grad in zip(accum_gradient, gradients)]


    # Now, after executing all the tapes you needed, we apply the optimization step
    # (but first we take the average of the gradients)
    accum_gradient = [this_grad/num_samples for this_grad in accum_gradient]
    # apply optimization step
    self.optimizer.apply_gradients(zip(accum_gradient,train_vars))
        

    epoch_loss = total_loss / num_samples
    print(f'Epoch loss: {epoch_loss}')

Using tf.Variable() should be avoided inside the training loop, since it will produce errors when trying to execute the code as a graph. If you use tf.Variable() inside your training function and then decorate it with "@tf.function" or apply "tf.function(my_train_fcn)" to obtain a graph function (i.e. for improved performance), the execution will rise an error. This happens because the tracing of the tf.Variable function results in a different behaviour than the observed in eager execution (re-utilization or creation, respectively). You can find more info on this in the tensorflow help page.

Ramiro R.C.
  • 388
  • 3
  • 6
  • In `[accum_gradient.append(tf.zeros_like(this_var)) for this_var in train_vars]`, did you add the enclosing square brackets by mistake? – Nagabhushan S N Jul 02 '20 at 10:31
  • Can you elaborate on why `tf.Variable()` should be avoided inside the training loop? - 1. why did it even come up? 2. Some references would be good. – Nagabhushan S N Jul 02 '20 at 10:31
  • Hey! Thank you for the approach I've found it very useful. One issue I noted is, when I set up my code as shown above, I hit memory issues half way through the range of num_samples on the **second** epoch. This seems odd to me since the accum_gradient is recreated after each epoch. Would you agree or do you expect the memory usage to increase like this? Not sure if this could be related too but the only difference I can see with my code and above is that I have `tf.config.run_functions_eagerly(True)` – A_Murphy Jun 22 '22 at 13:52
  • 2
    Hi A_Murphy, The memory usage should not increase, I have used this approach for training in large models for many steps and did not found any memory leak. However I have never used it for training in eager mode, there might be some issue with variables being re-created and not being deleted. Eager mode enables the variable creation inside the loop, in graph mode this cannot happen. – Ramiro R.C. Jun 22 '22 at 14:35
  • Hey Ramiro, thanks for that! I'll try avoid eager mode and see if that helps – A_Murphy Jun 22 '22 at 15:20
  • I'm wondering, if the gradient accumulation code (inside the epochs loop), would run in a @tf.function, would the memory grow with the num_samples? (When unrolling the graph over the "for j in tqdm(range(num_samples)):"-loop?) – Visionscaper Sep 18 '22 at 17:06
3

In line with the Stack Overflow Answer and the explanation provided in Tensorflow Website, mentioned below is the code for Accumulating Gradients in Tensorflow Version 2.0:

def train(epochs):
  for epoch in range(epochs):
    for (batch, (images, labels)) in enumerate(dataset):
       with tf.GradientTape() as tape:
        logits = mnist_model(images, training=True)
        tvs = mnist_model.trainable_variables
        accum_vars = [tf.Variable(tf.zeros_like(tv.initialized_value()), trainable=False) for tv in tvs]
        zero_ops = [tv.assign(tf.zeros_like(tv)) for tv in accum_vars]
        loss_value = loss_object(labels, logits)

       loss_history.append(loss_value.numpy().mean())
       grads = tape.gradient(loss_value, tvs)
       #print(grads[0].shape)
       #print(accum_vars[0].shape)
       accum_ops = [accum_vars[i].assign_add(grad) for i, grad in enumerate(grads)]



    optimizer.apply_gradients(zip(grads, mnist_model.trainable_variables))
    print ('Epoch {} finished'.format(epoch))

# Call the above function    
train(epochs = 3)

Complete code can be found in this Github Gist.

  • 1
    Shouldn't `accum_vars` be passed to `apply_gradients()` function? Like `optimizer.apply_gradients(zip(accum_vars, mnist_model.trainable_variables))`. As far as I understand `accum_vars[i].assign_add(grad)` adds `grad` to `accum_vars[i]`. So at the end, `accum_vars` has accumulated the gradients and `grads` only has last batch gradients. – Nagabhushan S N Feb 12 '20 at 08:08
  • @NagabhushanSN, I think they are training the model as normal, but are only accumulating for the purposes of model analysis. If you wanted to accumulate the gradients for mini batches, you are correct. You would need to move the accum_vars outside of the last for loop. Although, I'm not sure if you would need to average the gradients together before applying the gradients.. – targetXING Jun 03 '20 at 03:01
  • @FreedomToWin thanks. But even then the code appears wrong. They're not applying the gradients from each batch. The gradients are applied only after the completion of the batch for loop, so that means, only gradients from last batch are applied. And, what kind of analysis do you mean? Can you refer me to any articles that analyse gradients from each batch in each epoch. Just curious to see. Thanks! – Nagabhushan S N Jun 03 '20 at 13:20
  • I’ve heard of using the average gradient of the prediction w.r.t to the inputs as feature importance. You are right about the code being wrong. Why not just accumulate the losses instead for training? – targetXING Jun 04 '20 at 18:28