0

I prepare the dataset and save it as as hdf5 file. I have a custom data generator that subclasses Sequence from keras and generates batches from the hdf5 file.

Now, when I model.fit_generator using the train generator, the model uses the GPU and trains fast for the first 2 epochs (GPU memory is full and GPU volatile usage fluctuates nicely around 50%). However, after the 3rd epoch, GPU volatile usage is 0% and the epoch takes 20x as long.

What's going on here?

Youi Rabi
  • 169
  • 2
  • 9

2 Answers2

1

Can you try configuring GPU as given in this post https://www.tensorflow.org/guide/gpu

Here is how i have done in my program

print("Runnning Jupyter Notebook using python version: {}".format(python_version()))
print("Running tensorflow version: {}".format(tf.keras.__version__))
print("Running tensorflow.keras version: {}".format(tf.__version__))
print("Running keras version: {}".format(keras.__version__))
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
tf.config.experimental.list_physical_devices('GPU')

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 2GB of memory on the first GPU
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=2048)])
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

Here is the output of above command:

Runnning Jupyter Notebook using python version: 3.7.7
Running tensorflow version: 2.2.4-tf
Running tensorflow.keras version: 2.1.0
Running keras version: 2.3.1
Num GPUs Available:  1
1 Physical GPUs, 1 Logical GPUs

Value might differ, memory_limit=2048 is the amount of memory allocated to GPU device.

To confirm allocation please use nvidia-smi, if you run with this config keras won't increase memory usage. As you told that after 2 epochs it is very slow, can you tell further does kernel dies, system hangs or restarts? Issues without config I have faced, is system just hangs. If you are running on ubuntu 18.04 LTS please use System Monitor(visually tells how many cores are being used, periodic contants increase means something is wrong) tool before executing all cells in notebook, once you start check Resources Tab in System Monitor.

Do:

  • A fresh run
  • Or Restart & Run All

Suspected Issue: How to prevent tensorflow from allocating the totality of a GPU memory?

silentsudo
  • 6,730
  • 6
  • 39
  • 81
-1

Same Error Here!!

Because when you install tensorflow-gpu along nivida tool kit it provide a limited amount of GPU memory (Here in my case 2GB) .Due to leak of memory it release GPU finally and turn to use CPU .

if you want to avoid such condition Use Google Colab which provide about 36.7GB GPU memory.

Welcome_back
  • 1,245
  • 12
  • 18
  • 1
    switching to collab is not an efficient solution, you still should be able to utilize local memory if available. What if colab goes full premium tomorrow? – silentsudo May 28 '20 at 12:20