0

tf 2.0.0-gpu CUDA 10.0 RTX2070super

hi. i got a problem regarding allocating gmemory. The initial allocation of memory is 7GB like this.

Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6994 MB memory)

2020-01-11 22:19:22.983048: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-01-11 22:19:23.786225: I tensorflow/stream_executor/cuda/cuda_driver.cc:830] failed to allocate 2.78G (2989634304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2020-01-11 22:19:24.159338: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

Limit: 7333884724 InUse: 5888382720 MaxInUse: 6255411968 NumAllocs: 1264 MaxAllocSize: 2372141056

but i can only use 5900MB memory and the rest of memory always fails to be allocated.

i guess that if whole gpu memory is used in rtx 2070s, i use 2 types data typse(float16, float32). so i got a policy by using this codes

opt = tf.keras.optimizers.Adam(1e-4)

opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)

Still, the allocation always fails.

Community
  • 1
  • 1
  • 1
    Please formulate your question better. One way to do so is by starting with enough details about the code(s) you have tried and then pasting your full error trace in a nice format. Please see stackoverflow guide about how to format (https://meta.stackoverflow.com/questions/251361/how-do-i-format-my-code-blocks) and (https://stackoverflow.com/editing-help) – Amit Jan 11 '20 at 13:41
  • [This](https://stackoverflow.com/questions/58575279/does-model-fit-upload-the-whole-training-dataset-to-the-gpu/58575326#58575326) may help – OverLordGoldDragon Jan 12 '20 at 01:25

1 Answers1

0

Tensorflow memory management can be frustrating.

Main takeaway: whenever you see OOM there is actually not enough memory and you either have to reduce your model size or batch size. TF would throw OOM when it tries to allocate sufficient memory, regardless of how much memory has been allocated before.

On the start, TF would try to allocate a reasonably large chunk of memory which would be equivalent to about 90-98% of the whole memory available - 5900MB in your case. Then, when actual data starts to take more than that, TF would additionally try to allocate sufficient amount of memory or a bit more - 2.78G. And if that does not fit it would throw OOM, like in your case. Your GPU could not fit 5.9+2.8Gb. The last chunk of 2.78G might actually be a little more than TF needs, but it would anyhow be used later if you have multiple training steps because maximum required memory can fluctuate a bit between identical Session.run's.

y.selivonchyk
  • 8,987
  • 8
  • 54
  • 77