I'm using Keras to train a convolutional neural network using the fit_generator function as the images are stored in .h5 files and don't fit in memory. Most of the times I'm not able to train the model as it gets stuck in the middle of the first epoch, or it crashes saying 'GPU sync failed' or 'CUDA_ERROR_LAUNCH_FAILED' (see the logs below). The training using the CPUs works well but of course it is slower. I'm using two different machines and both have the same issues. My guess is that it is an installation/configuration related problem but I don't know how to fix it.
On both machines Tensorflow was installed as explained here: https://www.anaconda.com/blog/developer-blog/tensorflow-in-anaconda/
I have used this script https://github.com/tensorflow/tensorflow/blob/master/tools/tf_env_collect.sh to collect the following informations.
Here the tf_env.txt
First machine:
Keras 2.2.4.
== cat /etc/issue ===============================================
Linux liph02.novalocal 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
VERSION="7 (Core)"
VERSION_ID="7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
== are we in docker =============================================
No
== compiler =====================================================
c++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
== uname -a =====================================================
Linux liph02.novalocal 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
== check pips ===================================================
numpy 1.15.4
numpydoc 0.8.0
protobuf 3.6.1
tensorflow 1.12.0
== check for virtualenv =========================================
False
== tensorflow import ============================================
tf.VERSION = 1.12.0
tf.GIT_VERSION = b'unknown'
tf.COMPILER_VERSION = b'unknown'
Sanity check: array([1], dtype=int32)
== env ==========================================================
LD_LIBRARY_PATH /usr/local/cuda-9.2/lib64
DYLD_LIBRARY_PATH is unset
== nvidia-smi ===================================================
Fri Dec 28 16:13:39 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:00:06.0 Off | N/A |
| 22% 38C P0 57W / 250W | 0MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
== cuda libs ===================================================
/usr/local/Wolfram/Mathematica/11.3/SystemFiles/Components/MXNetLink/LibraryResources/Linux-x86-64/libcudart.so.9.1
Second machine:
Keras 2.2.4.
== cat /etc/issue ===============================================
Linux liph01.novalocal 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
VERSION="7 (Core)"
VERSION_ID="7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
== are we in docker =============================================
No
== compiler =====================================================
c++ (GCC) 7.3.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
== uname -a =====================================================
Linux liph01.novalocal 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
== check pips ===================================================
msgpack-numpy 0.4.3.2
numpy 1.15.3
numpydoc 0.8.0
protobuf 3.6.0
tensorflow 1.11.0
== check for virtualenv =========================================
False
== tensorflow import ============================================
tf.VERSION = 1.11.0
tf.GIT_VERSION = b'unknown'
tf.COMPILER_VERSION = b'unknown'
== env ==========================================================
LD_LIBRARY_PATH is unset
DYLD_LIBRARY_PATH is unset
== nvidia-smi ===================================================
Thu Jan 3 17:38:44 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48 Driver Version: 410.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:00:07.0 Off | N/A |
| 40% 65C P2 94W / 250W | 11747MiB / 12196MiB | 90% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 16991 C python 11737MiB |
+-----------------------------------------------------------------------------+
== cuda libs ===================================================
/usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudart_static.a
/usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudart.so.9.2.148
/usr/local/cuda-9.2/doc/man/man7/libcudart.7
/usr/local/cuda-9.2/doc/man/man7/libcudart.so.7
Here the two stacktraces
(dev) -bash-4.2$ python classifier_training.py --dirs /data/simulations/Paranal_gam/ /data/simulations/Paranal_prot/ --epochs 1 --batch_size 32 --workers 16 --model ClassifierV2 --patience 1
Using TensorFlow backend.
ClassifierV2
Building training generator...
Building validation generator...
2018-12-18 12:15:19.553286: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2018-12-18 12:15:20.043811: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-12-18 12:15:20.047991: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:00:06.0
totalMemory: 11.91GiB freeMemory: 11.75GiB
2018-12-18 12:15:20.048093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
Traceback (most recent call last):
File "classifier_training.py", line 122, in <module>
model = class_v2.get_model()
File "/data/ctasoft/cta-lstchain/cnn/classifiers.py", line 40, in get_model
self.model.add(Conv2D(16, kernel_size=(3, 3), input_shape=(1, self.img_rows, self.img_cols), data_format='channels_first', activation='relu'))
File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/engine/sequential.py", line 165, in add
layer(x)
File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/engine/base_layer.py", line 457, in __call__
output = self.call(inputs, **kwargs)
File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/layers/convolutional.py", line 171, in call
dilation_rate=self.dilation_rate)
File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 3641, in conv2d
x, tf_data_format = _preprocess_conv2d_input(x, data_format)
File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 3521, in _preprocess_conv2d_input
if not _has_nchw_support() or force_transpose:
File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 292, in _has_nchw_support
gpus_available = len(_get_available_gpus()) > 0
File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 278, in _get_available_gpus
_LOCAL_DEVICES = get_session().list_devices()
File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 186, in get_session
_SESSION = tf.Session(config=config)
File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1551, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/data/ctasoft/anaconda3/envs/cta-dev/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 676, in __init__
self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: unspecified launch failure
(dev) -bash-4.2$ python classifier_training.py --dirs /data/simulations/Paranal_gam /data/simulations/Paranal_prot --workers 1 --epochs 10 --batch_size 16 --model ClassifierV2 --patience 9
Using TensorFlow backend.
ClassifierV2
Building training generator...
Building validation generator...
2018-12-29 19:29:11.142008: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2018-12-29 19:29:11.892617: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning
NUMA node zero
2018-12-29 19:29:11.896828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:00:06.0
totalMemory: 11.91GiB freeMemory: 11.75GiB
2018-12-29 19:29:11.896880: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-29 19:29:12.960736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-29 19:29:12.960804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-12-29 19:29:12.960819: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-12-29 19:29:12.961681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11366 MB memory) -> physical GPU (device:
0, name: TITAN Xp, pci bus id: 0000:00:06.0, compute capability: 6.1)
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 16, 98, 98) 160
_________________________________________________________________
conv2d_2 (Conv2D) (None, 16, 96, 96) 2320
_________________________________________________________________
average_pooling2d_1 (Average (None, 16, 48, 48) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 16, 48, 48) 0
_________________________________________________________________
conv2d_3 (Conv2D) (None, 32, 46, 46) 4640
_________________________________________________________________
conv2d_4 (Conv2D) (None, 32, 44, 44) 9248
_________________________________________________________________
average_pooling2d_2 (Average (None, 32, 22, 22) 0
_________________________________________________________________
dropout_2 (Dropout) (None, 32, 22, 22) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 15488) 0
_________________________________________________________________
dense_1 (Dense) (None, 128) 1982592
_________________________________________________________________
dropout_3 (Dropout) (None, 128) 0
_________________________________________________________________
dense_2 (Dense) (None, 256) 33024
_________________________________________________________________
dropout_4 (Dropout) (None, 256) 0
_________________________________________________________________
dense_3 (Dense) (None, 1) 257
=================================================================
Total params: 2,032,241
Trainable params: 2,032,241
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
4/8065 [..............................] - ETA: 1:52:06 - loss: 0.9940 - acc: 0.4531 - precision: 0.4947 - recall: 0.71882018-12-29 19:29:54.459471: E tensorflow/stream_executor/cuda/cuda_event.cc:48] E
rror polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2018-12-29 19:29:54.459645: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1
Aborted