4

I am trying to train my model using tensorflow.keras, but it is failing after some number of epochs due to OOM. Tensorflow 2.0 has marked many things as deprecated, and I can't tell how I am supposed to diagnose the problem.

The network is a series of Conv1D layers and a few self-attention layers converting from one sequence to another. The sequences are variable length, but there is no correlation between sequence length and when it fails. IE: it may process a 6 minute sequence fine, but fail on a 4 minute one.

with tensorflow.device('/device:gpu:0'):
    m2t = BuildGenerator() #builds and returns model
    m2t.compile(optimizer='adam', loss='mse')
    for epoch in range(1):
        for inout in InputGenerator(params):
            m2t.train_on_batch(inout[0], inout[1])

Things I have tried:

  1. Removing the self-attention layers. It still fails
  2. Removing all but a small number of layers. It still fails
  3. Padding all sequences to a constant length. It still fails
  4. Using m2t.predict(inout[0]) instead of train_on_batch. It fails, but it takes longer.
  5. Use tensorflow.summary.trace_export. It records something, but it doesn't load in chrome, like the page HERE suggests.
  6. I looked at THIS answer, but with the changes in TF-2.0, I'm not sure how to do that properly.

There are no other calls into tensorflow or keras.

EDIT: As requested, sample error logs. It is a slightly different error every time.

A few of these, with a few successful runs in-between.

W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

Then it starts with this, and a giant list of "# chunks of size ..." and "InUse..."

W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 43.26MiB (rounded to 45360128).  Current allocation summary follows.
I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (256):   Total Chunks: 79, Chunks in use: 79. 19.8KiB allocated for chunks. 19.8KiB in use in bin. 2.2KiB client-requested in use in bin.
...
I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 8.40GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 9109728768 memory_limit_: 9109728789 available bytes: 21 curr_region_allocation_bytes_: 17179869184
I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
    Limit:                  9109728789
    InUse:                  9024084224
    MaxInUse:               9024084224
    NumAllocs:                   38387
    MaxAllocSize:           1452673536

W tensorflow/core/common_runtime/bfc_allocator.cc:424] 


W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at cwise_ops_common.cc:82 : Resource exhausted: OOM when allocating tensor with shape[1,45000,12,21] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File ".\TrainGNet.py", line 380, in <module>
    m2t.train_on_batch(inout[0], inout[1])
  File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 973, in train_on_batch
    class_weight=class_weight, reset_metrics=reset_metrics)
  File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py", line 264, in train_on_batch
  output_loss_metrics=model._output_loss_metrics)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\tensorflow_core\python\keras\engine\training_eager.py", line 311, in train_on_batch
  output_loss_metrics=output_loss_metrics))
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\tensorflow_core\python\keras\engine\training_eager.py", line 268, in _process_single_batch
  grads = tape.gradient(scaled_total_loss, trainable_weights)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\tensorflow_core\python\eager\backprop.py", line 1014, in gradient
  unconnected_gradients=unconnected_gradients)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\tensorflow_core\python\eager\imperative_grad.py", line 76, in imperative_grad
  compat.as_str(unconnected_gradients.value))
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\tensorflow_core\python\eager\backprop.py", line 138, in _gradient_function
  return grad_fn(mock_op, *out_grads)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\tensorflow_core\python\ops\math_grad.py", line 251, in _MeanGrad
  return math_ops.truediv(sum_grad, math_ops.cast(factor, sum_grad.dtype)), None
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\tensorflow_core\python\util\dispatch.py", line 180, in wrapper
  return target(*args, **kwargs)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\tensorflow_core\python\ops\math_ops.py", line 1066, in truediv
  return _truediv_python3(x, y, name)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\tensorflow_core\python\ops\math_ops.py", line 1005, in _truediv_python3
  return gen_math_ops.real_div(x, y, name=name)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\tensorflow_core\python\ops\gen_math_ops.py", line 7950, in real_div
  _six.raise_from(_core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,45000,12,21] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:RealDiv] name: truediv/

EDIT2 and 3: Here is a minimal example. This fails after printing '11' for me. Edit3 reduced the size significantly.

from tensorflow.keras.models import Model
from tensorflow.keras.layers import *
import tensorflow.keras.backend as K
import numpy as np
import tensorflow

def BuildGenerator():
    i = Input(shape=(None,2,))

    n_input = 12*21
    to_n = Input(shape=(n_input))
    s_n = Dense(12*21, activation='softmax')(to_n)
    s_n = Reshape((12,21))(s_n)
    n_base = Model(inputs=[to_n], outputs=[s_n])

    b = Conv1D(n_input, 11, dilation_rate=1, padding='same', activation='relu', data_format='channels_last')(i)
    n = TimeDistributed(n_base)(b)

    return Model(inputs=[i], outputs=[n])

def InputGenerator():
    for iter in range(1000):
        print(iter)
        i = np.zeros((1,10*60*1000,2))
        n = np.zeros((1,10*60*1000,12,21))
        yield ([i], [n])

with tensorflow.device('/device:gpu:0'):

    m2t = BuildGenerator()

    m2t.compile(optimizer='adam', loss='mse')

    for epoch in range(1):
        for inout in InputGenerator():
            m2t.train_on_batch(inout[0], inout[1])
Tetragramm
  • 41
  • 1
  • 2
  • 6
  • you may have run out of memory for the model to run. **Need the error logs that TensorFlow gives to tell more of what went wrong.** – Ramesh Kamath Oct 14 '19 at 07:23
  • @RameshKamath I have added a minimal example that requires no extra content but still reproduces the problem, as well as a sample error message. – Tetragramm Oct 14 '19 at 23:04

3 Answers3

0

My simple recommendation:

  • decrease your batch size to minimum, start with 1 and then grow the size.

In most cases, this helps.

eugen
  • 1,249
  • 9
  • 15
  • then what do I think? I think you need to provide more details, like provide your whole code, otherwise one is left to wonder what could have possibly gone wrong – eugen Oct 14 '19 at 16:17
  • 1
    Both error messages, and a minimal example have been added. – Tetragramm Oct 14 '19 at 23:02
-1

You can try to use :

tf.config.gpu.set_per_process_memory_fraction(0.5)
tf.config.gpu.set_per_process_memory_growth(True)

in TF-2.0. Remember to declere these before any operation. I mean you can simply add them in the beginning of your code.

Physicing
  • 532
  • 5
  • 17
  • No success. Same problem. Also, those don't seem to be valid methods in TF-2.0. I found the memory_growth, but no memory_fraction. I expect that would just make it fail sooner. – Tetragramm Oct 14 '19 at 23:03
  • @Tetragramm second was replaced by method below, however still gives me the error. try switching to cpu version, i only started having this issues when i switched to gpu... tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)]) – Mr-Programs Jun 02 '20 at 04:24
-1

This is a memory allocation problem, where TensorFlow tries to allocate the entire model graph with weights to GPU, But GPU's GDDR RAM is not enough for the large model and its weights. (TensorFlow takes more memory to allocate layers than required and CNN also require more memory)

you can try using CPU if you have enough CPU RAM to load the model or reduce the shape and size of the model. Memory fraction can also help.

you can also fraction the GPU so that the model is cut by the Tensorflow and trained in cycles one part after another.

Use watch nvidia-smi to track your model's Nvidia GPU memory usage to optimize the model.

Ramesh Kamath
  • 189
  • 1
  • 6
  • This is not the problem, because it does allocate the entire model graph, and runs for several batches. Based on further experimentation, I believe this to be a memory leak in TimeDistributed. [Tensorflow Issue](https://github.com/tensorflow/tensorflow/issues/33178) – Tetragramm Oct 15 '19 at 23:29
  • when i got the `OOM` Errors, model was starting build. I searched for the error in internet and found that the whole model is too big to fit in GPU memory or system memory. my problem was solved when i set GPU fraction to `0.7`. this happened for `Tensorflow <= 1.10`. **If it ran for some time and got `OOM` Error then it may beacause of memory leak bug in `Tensorflow 2.0`**. Share your computers memory details(GPU,CPU etc), as the `EDIT 2` worked in my computer without any problem **(for tensorflow-gpu==1.10)** and i cannot replicate the error *(without changing tensorflow-gpu,CUDA etc.)* – Ramesh Kamath Oct 17 '19 at 08:21