11

Please add a minimum comment on your thoughts so that I can improve my query. Thank you. -)


I'm trying to train a tf.keras model with Gradient Accumulation (GA). But I don't want to use it in the custom training loop (like) but customize the .fit() method by overriding the train_step.Is it possible? How to accomplish this? The reason is if we want to get the benefit of keras built-in functionality like fit, callbacks, we don't want to use the custom training loop but at the same time if we want to override train_step for some reason (like GA or else) we can customize the fit method and still get the leverage of using those built-in functions.

And also, I know the pros of using GA but what are the major cons of using it? Why does it's not come as a default but an optional feature with the framework?

# overriding train step 
# my attempt 
# it's not appropriately implemented 
# and need to fix 
class CustomTrainStep(keras.Model):
    def __init__(self, n_gradients, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.n_gradients = n_gradients
        self.gradient_accumulation = [
            tf.zeros_like(this_var) for this_var in  self.trainable_variables
        ]

    def train_step(self, data):
        x, y = data
        batch_size = tf.cast(tf.shape(x)[0], tf.float32)  
        # Gradient Tape
        with tf.GradientTape() as tape:
            y_pred = self(x, training=True)
            loss = self.compiled_loss(
                y, y_pred, regularization_losses=self.losses
            )
            
        # Calculate batch gradients
        gradients = tape.gradient(loss, self.trainable_variables)
        # Accumulate batch gradients
        accum_gradient = [
            (acum_grad+grad) for acum_grad, grad in \
            zip(self.gradient_accumulation, gradients)
        ]
        accum_gradient = [
            this_grad/batch_size for this_grad in accum_gradient
        ]
        
        # apply accumulated gradients
        self.optimizer.apply_gradients(
            zip(accum_gradient, self.trainable_variables)
        )
        # TODO: reset self.gradient_accumulation 
        # update metrics
        self.compiled_metrics.update_state(y, y_pred)
        return {m.name: m.result() for m in self.metrics}

Please, run and check with the following toy setup.

# Model 
size = 32

input = keras.Input(shape=(size,size,3))
efnet = keras.applications.DenseNet121(
    weights=None,
    include_top = False, 
    input_tensor = input
)
base_maps = keras.layers.GlobalAveragePooling2D()(efnet.output) 
base_maps = keras.layers.Dense(
    units=10, activation='softmax', 
    name='primary'
)(base_maps)

custom_model = CustomTrainStep(
    n_gradients=10, inputs=[input], outputs=[base_maps]
)
# bind all
custom_model.compile(
    loss = keras.losses.CategoricalCrossentropy(),
    metrics = ['accuracy'],
    optimizer = keras.optimizers.Adam()
)
# data 
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = tf.expand_dims(x_train, -1)
x_train = tf.repeat(x_train, 3, axis=-1)
x_train = tf.divide(x_train, 255)
x_train = tf.image.resize(x_train, [size,size]) # if we want to resize 
y_train = tf.one_hot(y_train , depth=10) 

# customized fit 
custom_model.fit(x_train, y_train, batch_size=64, epochs=3, verbose = 1)

Update

I've found that some others also tried to achieve this and ended up with the same issue. One has got some workaround, here, but it's too messy and I think there should be some better approach.

Update 2

The accepted answer (by Mr.For Example) is fine and works well in single strategy. Now, I like to start 2nd bounty to extend it to support multi-gpu, tpu, and with mixed-precision techniques. There are some complications, see details.

Innat
  • 16,113
  • 6
  • 53
  • 101

2 Answers2

16

Yes it is possible to customize the .fit() method by overriding the train_step without a custom training loop, following simple example will show you how to train a simple mnist classifier with gradient accumulation:

import tensorflow as tf
 
class CustomTrainStep(tf.keras.Model):
    def __init__(self, n_gradients, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.n_gradients = tf.constant(n_gradients, dtype=tf.int32)
        self.n_acum_step = tf.Variable(0, dtype=tf.int32, trainable=False)
        self.gradient_accumulation = [tf.Variable(tf.zeros_like(v, dtype=tf.float32), trainable=False) for v in self.trainable_variables]

    def train_step(self, data):
        self.n_acum_step.assign_add(1)

        x, y = data
        # Gradient Tape
        with tf.GradientTape() as tape:
            y_pred = self(x, training=True)
            loss = self.compiled_loss(y, y_pred, regularization_losses=self.losses)
        # Calculate batch gradients
        gradients = tape.gradient(loss, self.trainable_variables)
        # Accumulate batch gradients
        for i in range(len(self.gradient_accumulation)):
            self.gradient_accumulation[i].assign_add(gradients[i])
 
        # If n_acum_step reach the n_gradients then we apply accumulated gradients to update the variables otherwise do nothing
        tf.cond(tf.equal(self.n_acum_step, self.n_gradients), self.apply_accu_gradients, lambda: None)

        # update metrics
        self.compiled_metrics.update_state(y, y_pred)
        return {m.name: m.result() for m in self.metrics}

    def apply_accu_gradients(self):
        # apply accumulated gradients
        self.optimizer.apply_gradients(zip(self.gradient_accumulation, self.trainable_variables))

        # reset
        self.n_acum_step.assign(0)
        for i in range(len(self.gradient_accumulation)):
            self.gradient_accumulation[i].assign(tf.zeros_like(self.trainable_variables[i], dtype=tf.float32))

# Model 
input = tf.keras.Input(shape=(28, 28))
base_maps = tf.keras.layers.Flatten(input_shape=(28, 28))(input)
base_maps = tf.keras.layers.Dense(128, activation='relu')(base_maps)
base_maps = tf.keras.layers.Dense(units=10, activation='softmax', name='primary')(base_maps) 
custom_model = CustomTrainStep(n_gradients=10, inputs=[input], outputs=[base_maps])

# bind all
custom_model.compile(
    loss = tf.keras.losses.CategoricalCrossentropy(),
    metrics = ['accuracy'],
    optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3) )

# data 
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = tf.divide(x_train, 255)
y_train = tf.one_hot(y_train , depth=10) 

# customized fit 
custom_model.fit(x_train, y_train, batch_size=6, epochs=3, verbose = 1)

Outputs:

Epoch 1/3
10000/10000 [==============================] - 13s 1ms/step - loss: 0.5053 - accuracy: 0.8584
Epoch 2/3
10000/10000 [==============================] - 13s 1ms/step - loss: 0.1389 - accuracy: 0.9600
Epoch 3/3
10000/10000 [==============================] - 13s 1ms/step - loss: 0.0898 - accuracy: 0.9748

Pros:

Gradient accumulation is a mechanism to split the batch of samples — used for training a neural network — into several mini-batches of samples that will be run sequentially

enter image description here

Because GA calculates the loss and gradients after each mini-batch, but instead of updating the model parameters, it waits and accumulates the gradients over consecutive batches, so it can overcoming memory constraints, i.e using less memory to training the model like it using large batch size.

Example: If you run a gradient accumulation with steps of 5 and batch size of 4 images, it serves almost the same purpose of running with a batch size of 20 images.

We could also parallel the training when using GA, i.e aggregate gradients from multiple machines.

Things to consider:

This technique is working so well so it is widely used, there few things to consider before using it that I don't think it should be called cons, after all, all GA does is turning 4 + 4 to 2 + 2 + 2 + 2.

If your machine has sufficient memory for the batch size that already large enough then there no need to use it, because it is well known that too large of a batch size will lead to poor generalization, and it will certainly run slower if you using GA to achieve the same batch size that your machine's memory already can handle.

Reference:

What is Gradient Accumulation in Deep Learning?

Innat
  • 16,113
  • 6
  • 53
  • 101
Mr. For Example
  • 4,173
  • 1
  • 9
  • 19
  • Thanks, buddy <^..^>. Can you elaborate on the **GA** process? How does `tf. cond` working here, based on `n_gradients` to 0, 1, 2, 3 step? How the `apply_accu_gradients` operating here? If I use `n_gradients = 10` and `batch_size = 64`, I guess `train_step` will execute 10 times, each time with 64 training pairs. And using such a large size `global batch = batch_size * n_gradients`, GPU utilities should be much higher than `n_gradients = 1`; but in my local machine (2070), it almost consumes the same. – Innat Mar 08 '21 at 08:11
  • And [this](https://stackoverflow.com/a/62683800/9215780) answer raises a concern about using `tf. Variable`, what do you think about that? – Innat Mar 08 '21 at 08:11
  • I'll add some more detail about GA in answer, and `tf.cond(condition, true_fn, false_fn)` is like `if` condition in graph, what it does in my code above is: if `n_acum_step` reach the`n_gradients` then we apply accumulated gradients to update the variables otherwise do nothing. GPU utilities will not be effected by `n_gradients` but by `batch_size` at each step that is whole point of why we using GA so we can overcoming memory constraints for arbitrarily large global batch size – Mr. For Example Mar 08 '21 at 09:00
  • And that statement about `tf.Variable()` should be avoided inside the training loop is correct, that's why I define the variable at beginning of the class and only assign value to those variable inside the training loop – Mr. For Example Mar 08 '21 at 09:02
  • @M.Innat talk to me if you think I misunderstand some of your question : ) – Mr. For Example Mar 08 '21 at 09:13
  • Sorry, went to take launch :p, please let me see your updated answer. – Innat Mar 08 '21 at 09:28
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/229645/discussion-between-m-innat-and-mr-for-example). – Innat Mar 08 '21 at 09:45
  • I wrote some comments on the discussion chat, I think you missed that. Rewriting here - I've found that using GA also increases the model's non-trainable parameter significantly. Have you noticed it? – Innat Mar 10 '21 at 06:09
  • Even if I also set `n_gradients = 1` which means technically no GA but as usual case. – Innat Mar 10 '21 at 07:06
  • May I ask how did you come up with that observation? from the process of GA, non-trainable parameter that been added it's three variables we define at beginning of the class, i.e `n_gradients`, `n_acum_step`, `gradient_accumulation` – Mr. For Example Mar 10 '21 at 08:33
  • I just check `model.summary()`. Please correct me. – Innat Mar 10 '21 at 08:40
  • Yeah, just like what I said above, the trainable params is: 101,770, and non-trainable params: 101,771, is number of variables in `gradient_accumulation` which is equal to number of trainable params add `n_acum_step` which is one so : ) Don't worry about it, the memory will not doubled, because with GA or not we have to calculate and store the gradients for each trainable params, only with my code above, we store the gradients inside the model's non-trainable params, so the number looks little scary – Mr. For Example Mar 10 '21 at 08:51
  • Ask me in chat if you have more thought, I'll be there until you consider this question been solved – Mr. For Example Mar 10 '21 at 09:08
  • Yes, I also observed that it didn't affect the memory issue. Just wondering. – Innat Mar 10 '21 at 09:16
  • Ok, My all test is completed. Thanks again. Enjoy your reward brother. I will take a break and play pubg for some time :p – Innat Mar 10 '21 at 09:40
  • 1
    Cheers, have a good day, hope we meet again : ) – Mr. For Example Mar 10 '21 at 09:46
  • If I do this I am getting "_RuntimeError: `merge_call` called while defining a new graph or a tf.function. This can often happen if the function `fn` passed to `strategy.run()` contains a nested `@tf.function`, and the nested `@tf.function` contains a synchronization point, such as aggregating gradients (e.g, optimizer.apply_gradients),.._". (see also https://stackoverflow.com/questions/68121006/gradient-accumulation-assign-add-not-working) – Stefan Falk Jun 25 '21 at 05:58
  • When implementing this on my model I notice that the GPU memory consumption drops from 90%+ to about 15%. This behaviour seems really strange to me, why would this change in memory use occur? – Bidski Jul 29 '21 at 07:46
  • 1
    Turns out that the way my model is created means that `self.trainable_variables` is empty until after the first call to `self(x, trainable=True)`, so the creation of `self.gradient_accumulation` has to be deferred untilafter that – Bidski Jul 29 '21 at 08:30
  • This is a very useful answer, but as an extension, how do you save and load these models that were made with CustomTrainStep? I can't save these models as hdf5 (ValueError: Unable to create dataset (name already exists)) nor can I load them if I do end up saving them successfully as `tf` (TypeError: __init__() missing 2 required positional arguments: 'inputs' and 'outputs') – Nick Camarda Apr 09 '23 at 23:28
2

Thanks to @Mr.For Example for his convenient answer.

Usually, I also observed that using Gradient Accumulation, won't speed up training since we are doing n_gradients times forward pass and compute all the gradients. But it will speed up the convergence of our model. And I found that using the mixed_precision technique here can be really helpful here. Details here.

policy = tf.keras.mixed_precision.Policy('mixed_float16')
tf.keras.mixed_precision.experimental.set_policy(policy)

Here is a complete gist.

Innat
  • 16,113
  • 6
  • 53
  • 101