1

I want to use tensorflow's custom training loop for my model but, down to memory constraints, I can only pass a small number of samples (mini-batches) through in one go. How do I use an approach to train on these mini-batches and sensibly aggregate the gradients for the whole batch on one machine (GPU/CPU)? See below example with code from here - note this example doesn't hit memory issues based on the batch size but does give the idea of what I'm trying to do:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

#simple MNIST model
inputs = keras.Input(shape=(784,), name="digits")
x1 = layers.Dense(64, activation="relu")(inputs)
x2 = layers.Dense(64, activation="relu")(x1)
outputs = layers.Dense(10, name="predictions")(x2)
model = keras.Model(inputs=inputs, outputs=outputs)

# Instantiate an optimizer.
optimizer = keras.optimizers.SGD(learning_rate=1e-3)
# Instantiate a loss function.
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Prepare the training dataset.
batch_size = 64
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train = np.reshape(x_train, (-1, 784))
x_test = np.reshape(x_test, (-1, 784))

# Reserve 10,000 samples for validation.
x_val = x_train[-10000:]
y_val = y_train[-10000:]
x_train = x_train[:-10000]
y_train = y_train[:-10000]

# Prepare the training dataset.
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size)

# Prepare the validation dataset.
val_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val))
val_dataset = val_dataset.batch(batch_size)

If training on the full 64 sample batch size in one go could fit in memory we could simply use:

@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        logits = model(x, training=True)
        loss_value = loss_fn(y, logits)
    grads = tape.gradient(loss_value, model.trainable_weights)
    optimizer.apply_gradients(zip(grads, model.trainable_weights))
    train_acc_metric.update_state(y, logits)
    return loss_value

import time

epochs = 10
for epoch in range(epochs):
    print("\nStart of epoch %d" % (epoch,))
    start_time = time.time()

    # Iterate over the batches of the dataset.
    for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
        loss_value = train_step(x_batch_train, y_batch_train)

        # Log every 200 batches.
        if step % 200 == 0:
            print(
                "Training loss (for one batch) at step %d: %.4f"
                % (step, float(loss_value))
            )
            print("Seen so far: %d samples" % ((step + 1) * batch_size))

However, how do I update train_step to enable it to take four mini-batch runs of size 16 (for example) to make up the full batch size of 64 to deal with my more memory intensive data and then aggregate the gradients to update the model?

I tried just writing a loop within the with tf.GradientTape() as tape: call and just stacking the loss results but I don't think this is the correct approach.

I also thought about using tf.distribute.Strategy but my understanding is this is only for using when training across machines or GPUs so I don't see how I could use it here?

To summarise, What I want to do is agnostic to the dataset and model architecture. I guess I am looking for an Gradient AllReduce approach which in stead of splitting the mini-batches to different machines instead just runs them iteratively. So it would need to:

  1. Compute the gradient using a minibatch.
  2. Compute the mean of the gradients from all mini-batches, using a AllReduce collective-style approach.
  3. Update the model with the averaged gradient.

I assume this approach of applying the mean of the gradients would be far less memory intensive than applying all the gradients as discussed here

A_Murphy
  • 184
  • 2
  • 14
  • What is your question exactly? In training with `mnist` your memory gives oveflow? – I'mahdi Jun 21 '22 at 07:53
  • Or do you want an example for using `with tf.GradientTape() as tape`? – I'mahdi Jun 21 '22 at 07:54
  • The model here is just to explain the issue, it doesn't give memory overflow but my actual model and dataset does. The question is how to deal with memory overflow within `tf.GradientTape()`. – A_Murphy Jun 21 '22 at 08:07
  • exactly, I and other users when reading your question can't see a problem. please ask your exact question. explain your shape of images and send your preprocessing of images and model that you try to triain then meybe we can help you – I'mahdi Jun 21 '22 at 08:38
  • and if you want to train with `tf.GradientTape()` explain your image and dataset and ask how can run `tf.GradientTape()` on my datasets – I'mahdi Jun 21 '22 at 08:40
  • Hi I have updated the question to make it more clear. What I want to do is agnostic to the dataset and model architecture so there is no point in me including the data you ask for. All I am looking for is. a method of running `optimizer.apply_gradients` on all steps ran through `model(x, training=True)` in an aggregated manner so that `optimizer.apply_gradients` is being applied to the full batch size. This is necessary as, in my dataset, I can't run the full batch size through `model(x, training=True)` in one go – A_Murphy Jun 21 '22 at 09:41
  • Do you want set batch_size to 16 instead of 64? – I'mahdi Jun 21 '22 at 10:16
  • I used 16 as an example so yes, I would like to use batch_size 16 but update gradients based on the whole 64 (i.e. give the same result as using a batch size 64) – A_Murphy Jun 21 '22 at 10:25
  • I'm sure if you write `train_dataset = train_dataset.batch(16)`. your dataset goes to batch_size = 16, I will send you an example as answer – I'mahdi Jun 21 '22 at 10:28
  • That is not what I'm asking, that will give a different result since you are updating gradients after seeing 16 samples rather than 64. I want to update gradients after the full 64 batch size but down to memory issues I want to just pass 16 samples at a time – A_Murphy Jun 21 '22 at 10:36
  • 1
    Perhaps you can do gradient accumulation. This link [how to accumulate gradients in tensorflow 2](https://stackoverflow.com/questions/59893850/how-to-accumulate-gradients-in-tensorflow-2-0) might be useful – elbe Jun 21 '22 at 13:24
  • Hi elbe, yes this is more what I'm looking for but this does fall into the same issue that when you apply the gradients with your chosen optimiser. I guess what I actually want is a Gradient AllReduce approach that can be applied on the on GPU/CPU, running each mini batch in a loop. I will update my question to make more sense along these lines – A_Murphy Jun 21 '22 at 16:17
  • Whatever the optimization is you have to pass each batch to the model and compute the gradient. Agglomeration of the gradient allows to simulate the optimization of a larger batch. You put one smaller batch plus the accumulator in memory (GPU). You could do exactly the same thing and sum the gradients of all the batches, i.e. computing all the gradients with accumulation strategy and then a single pass of optimization. However, this goes against the philosophy of optimization in deep learning and may lead you to a bad local minimum. Why do you want to compute the gradient over all the data? – elbe Jun 22 '22 at 12:22
  • I suppose saying I want to compute the gradient across the whole batch was a poor example. In truth I want to do it for a larger mini-batch than is capable given available memory. My problem with the approach to pass through less samples and then optimise on the average (as explained in your link) is that this `apply_gradients` call seems to be as memory intensive as before – A_Murphy Jun 22 '22 at 13:32

0 Answers0