0

I am using NVIDIA RTX-2060 (with turing cores) for deep learning model training. As mentioned on online forums enabling mix precision training helps turing architecture cards train faster than without mix precision training. When i enabled mix precision training, the per step time increased instead of getting decreased. I can't fathom why this is happening, i'd really appreciate anyone suggesting a solution. I've spent so much money buying this gpu, it's no use if i can't get it to train models faster.

Code:

import tensorflow as tf
def create_model():
    model = keras.Sequential([
        keras.layers.Flatten(input_shape=(32,32,3)),
        keras.layers.Dense(3000, activation='relu'),
        keras.layers.Dense(1000, activation='relu'),
        keras.layers.Dense(10, activation='sigmoid')
    ])

    
    model.compile(optimizer= 'SDG',
                 loss= 'categorical_crossentropy',
                 metrics=['accuracy'])
    
    return model

tf.keras.mixed_precision.set_global_policy('mixed_float16')

%%timeit -n1 -r1 # time required toexecute this cell once

model = create_model()
model.fit(X_train_scaled, y_train_categorical, epochs=50)

Things You Must Know:

I have installed cuda and cudnn successfuly and tensorflow can detect my gpu.

I have installed tensorflow-gpu.

I am training my model on ciphar10 data set with nvidia rtx 2060 gpu.

Jupyter Notebook i've been using to benchmark: Link

Hissaan Ali
  • 2,229
  • 4
  • 25
  • 51

2 Answers2

4

As you're using the CIFAR dataset, I think your last layer activation should be softmax instead of sigmoid, and also your loss function categorical_crossentropy. And make sure it is float32.

keras.layers.Dense(10, activation='softmax', dtype=tf.float32)

You should set a mixed-precision global policy at the beginning, right after tf import. Here are some tips when using mixed precision on GPUs. From Doc.

Increasing your batch size

If it doesn't affect model quality, try running with double the batch size when using mixed precision. As float16 tensors use half the memory, this often allows you to double your batch size without running out of memory. Increasing batch size typically increases training throughput, i.e. the training elements per second your model can run on.

Ensuring GPU Tensor Cores are used

Modern NVIDIA GPUs use a special hardware unit called Tensor Cores that can multiply float16 matrices very quickly. However, Tensor Cores require certain dimensions of tensors to be a multiple of 8.

In the examples below, an argument is bold if and only if it needs to be a multiple of 8 for Tensor Cores to be used.

- tf.keras.layers.Dense(**units=64**)
- tf.keras.layers.Conv2d(**filters=48**, kernel_size=7, stride=3)
- tf.keras.layers.LSTM(**units=64**)
- tf.keras.Model.fit(epochs=2, **batch_size=128**) 

If you follow properly this procedure, then you should get the leverage of using mixed-precision. Here is one good reading from NVIDIA.

Innat
  • 16,113
  • 6
  • 53
  • 101
  • Do you mean that using mixed precision alone won't increase the speed, rather it would leave out around half the space so that we can double the batch size and get more computations done in the same amount of time ? – Hissaan Ali Feb 25 '21 at 14:43
  • I tried your solution, but the performance is still the same. Increasing batch size decreases accuracy so it's not gonna help. My problem is still there, i can't make it run faster with mix precision. – Hissaan Ali Feb 25 '21 at 14:52
  • Can you give some reproducible code fully (not the shallow one)? – Innat Feb 25 '21 at 16:03
  • I've added the link to original notebook in my question – Hissaan Ali Feb 25 '21 at 18:30
  • The link to the other notebook is not asked. And if you're following this notebook, then, please re-run the program by following the official instruction that I mentioned in my answer and update your query with a new result. In the notebook, there's plenty of issues there. – Innat Feb 28 '21 at 08:45
  • I've already re-run the code with your instructions. Without altering the batch size nothing improves. If we double the batch size with multiples of 32, speed increases but that also leads to higher loss. – Hissaan Ali Feb 28 '21 at 12:53
  • If possible, please update with (1), model definition (2). hyper-parameter (3). training logs.. and if possible please provide reproducible code. Otherwise, it's hard to break down. – Innat Feb 28 '21 at 13:00
1

According to the official guide from Tensorflow, To use mixed precision properly, your sigmoid activation at the end of the model should be float32. Because we set the policy mixed_float16, the activation's compute_dtype is float16. Thus, we have to overwrite the policy for this layer to float32.

def create_model():
    model = keras.Sequential([
        keras.layers.Flatten(input_shape=(32, 32, 3)),
        keras.layers.Dense(3000, activation='relu'),
        keras.layers.Dense(1000, activation='relu'),
        # keras.layers.Dense(10, activation='sigmoid'), # NOTE: Replaced this line by two lines below
        keras.layers.Dense(10,),
        keras.layers.Activation('sigmoid', dtype='float32'),
    ])

    model.compile(optimizer='SGD',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])

    return model

Put everything together, we have the complete source code for training CIFAR10 dataset with mixed precision:

import tensorflow as tf
from tensorflow import keras


def create_model():
    model = keras.Sequential([
        keras.layers.Flatten(input_shape=(32, 32, 3)),
        keras.layers.Dense(3000, activation='relu'),
        keras.layers.Dense(1000, activation='relu'),
        # keras.layers.Dense(10, activation='sigmoid'),
        keras.layers.Dense(10,),
        keras.layers.Activation('sigmoid', dtype='float32'),
    ])

    model.compile(optimizer='SGD',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])

    return model


tf.keras.mixed_precision.set_global_policy('mixed_float16')

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()

# There are 10 image classes
classes = ["airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"]

X_train_scaled = X_train / 255
X_test_scaled = X_test / 255

y_train_categorical = keras.utils.to_categorical(y_train, num_classes= 10, dtype='float')
y_test_categorical = keras.utils.to_categorical(y_test, num_classes= 10, dtype='float')

with tf.device('/GPU:0'):
    model = create_model()
    model.fit(X_train_scaled, y_train_categorical, epochs=50)

model.evaluate(X_test_scaled, y_test_categorical)

With my GPU NVIDIA RTX2080, I have compared the performance between with (called P1) and without (called P2) using mixed precision and found that:

  1. Training time: Because the time per step was rounded and they are the same for P1 and P2 (~6ms), I compared the overall training time for 50 epochs of P1 which is significantly faster than P2 (464s vs 501s).
  2. Testing time: P1 is still faster than P2 (3ms/step vs 4ms/step)
  3. Test performance (acc): P1 is better than P2 (~55.61% vs ~50.77%)
Thang Pham
  • 1,006
  • 2
  • 9
  • 18