I used tensorflow to train CNN with Nvidia Geforce 1060 (6G memory), but I got a OOM exception.
The training process was fine on first two epochs, but got the OOM exception on the third epoch.
============================ 2017-10-27 11:47:30.219130: W tensorflow/core/common_runtime/bfc_allocator.cc:277] **********************************************************************************************xxxxxx 2017-10-27 11:47:30.265389: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[10,10,48,48,48] Traceback (most recent call last): File"/anaconda3/lib/python3.6/sitepackages/tensorflow/python/client/session.py", line 1327, in _do_call return fn(*args) File "/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1306, in _run_fn status, run_metadata) File "/anaconda3/lib/python3.6/contextlib.py", line 88, in exit next(self.gen) File "/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10,10,48,48,48] [[Node: gradients_4/global/detector_scope/maxpool_conv3d_2/MaxPool3D_grad/MaxPool3DGrad = MaxPool3DGrad[T=DT_FLOAT, TInput=DT_FLOAT, data_format="NDHWC", ksize=[1, 2, 2, 2, 1], padding="VALID", strides=[1, 2, 2, 2, 1], _device="/job:localhost/replica:0/task:0/gpu:0"](global/detector_scope/maxpool_conv3d_2/transpose, global/detector_scope/maxpool_conv3d_2/MaxPool3D, gradients_4/global/detector_scope/maxpool_conv3d_2/transpose_1_grad/transpose)]] [[Node: Momentum_4/update/_540 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_1540_Momentum_4/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]
=============================
So, I am confused why I got this OOM exception on third epoch after it finishes processing the first two epochs.
Given that datasets are the same during each epoch, if I ran out of GPU memory, I should get the exception on the first epoch. But I did successfully finish two epochs. So, why did this happen later ?
Any suggestions, please ?