1

I'd like to train my model with tensorflow-gpu docker image can be pulled from the official. https://www.tensorflow.org/install/docker?hl=ja

I pulled tensorflow/tensorflow:latest-gpu-py3 and try to run it. nvidis-smi shows as below and looks fine.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    On   | 00000000:01:00.0 Off |                  N/A |
|  0%   36C    P8    12W / 240W |    449MiB /  8118MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

However once I run my training program, an error occurs and killed. Seems like it successfully detects gpu but it switches to see cpu for training. I don't know why and would like to fix it very much. Any advice will help. Thanks.

2020-04-24 04:40:09.584129: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-04-24 04:40:09.614730: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-24 04:40:09.615432: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1080 computeCapability: 6.1
coreClock: 1.8225GHz coreCount: 20 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 298.32GiB/s
2020-04-24 04:40:09.615467: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-24 04:40:09.615499: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-24 04:40:09.633870: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-24 04:40:09.638759: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-24 04:40:09.673825: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-24 04:40:09.678372: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-24 04:40:09.678419: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-24 04:40:09.678636: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-24 04:40:09.679369: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-24 04:40:09.679907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-24 04:40:09.680346: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-24 04:40:09.709091: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3299130000 Hz
2020-04-24 04:40:09.709813: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5ee7c50 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-24 04:40:09.709844: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-04-24 04:40:09.808073: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-24 04:40:09.808621: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5ee9ff0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-04-24 04:40:09.808639: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2020-04-24 04:40:09.808809: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-24 04:40:09.811968: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1080 computeCapability: 6.1
coreClock: 1.8225GHz coreCount: 20 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 298.32GiB/s
2020-04-24 04:40:09.812004: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-24 04:40:09.812017: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-24 04:40:09.812034: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-24 04:40:09.812048: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-24 04:40:09.812060: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-24 04:40:09.812074: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-24 04:40:09.812084: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-24 04:40:09.812157: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-24 04:40:09.812612: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-24 04:40:09.813015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-24 04:40:09.813534: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-24 04:40:10.318993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-24 04:40:10.319040: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-04-24 04:40:10.319050: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-04-24 04:40:10.319956: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-24 04:40:10.320671: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-24 04:40:10.321302: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7131 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
training starts
Epoch 1/1
2020-04-24 04:40:12.664294: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 2763676800 exceeds 10% of system memory.
2020-04-24 04:40:14.068163: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
Killed

and after I did pip install tensorflow-gpu==1.15.0, I get this error instead.

2020-04-24 07:33:12.894652: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-04-24 07:33:12.908383: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-24 07:33:12.909008: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.8225
pciBusID: 0000:01:00.0
2020-04-24 07:33:12.909092: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2020-04-24 07:33:12.909139: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
2020-04-24 07:33:12.909184: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
2020-04-24 07:33:12.909229: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
2020-04-24 07:33:12.909274: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
2020-04-24 07:33:12.909317: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
2020-04-24 07:33:12.912485: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-24 07:33:12.912518: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-04-24 07:33:12.912823: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-24 07:33:12.937080: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3299130000 Hz
2020-04-24 07:33:12.937348: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5231640 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-24 07:33:12.937374: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-04-24 07:33:13.027806: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-24 07:33:13.028359: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4e45350 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-04-24 07:33:13.028377: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2020-04-24 07:33:13.028453: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-24 07:33:13.028462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

Epoch 1/10
2020-04-24 07:33:13.930867: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 2763676800 exceeds 10% of system memory.
    1/69600 [..............................] - ETA: 62:22:56 - loss: 5.8635e-042020-04-24 07:33:16.725973: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 2763676800 exceeds 10% of system memory.
    2/69600 [..............................] - ETA: 56:18:29 - loss: 3.3783e-042020-04-24 07:33:19.324047: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 2763676800 exceeds 10% of system memory.
    3/69600 [..............................] - ETA: 54:16:40 - loss: 0.0038    2020-04-24 07:33:21.922656: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 2763676800 exceeds 10% of system memory.
    4/69600 [..............................] - ETA: 53:18:49 - loss: 0.01262020-04-24 07:33:24.531029: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 2763676800 exceeds 10% of system memory.
   46/69600 [..............................] - ETA: 50:50:19 - loss: 0.0270
user9191983
  • 505
  • 1
  • 4
  • 20

1 Answers1

0

What version of tensorflow are you using? The current version installs both CPU and GPU support. Earlier versions require installing a separate package for gpu support:

pip install tensorflow-gpu==1.15

It may also be a memory issue. See this thread: How can I solve 'ran out of gpu memory' in TensorFlow

Charles Carriere
  • 347
  • 1
  • 12
  • Thanks a lot for the answer! I did what you told me and an error has changed but it's still an error...please check out an error added to the question. – user9191983 Apr 24 '20 at 07:36
  • I made my training dataset half and it seems training with gpu...so the data may be too big or need to be divided into more batches...?? – user9191983 Apr 24 '20 at 07:49
  • I would use smaller batch sizes or try running this command at the start in case you've somehow allocated memory with previous training runs: tf.keras.backend.clear_session() – Charles Carriere Apr 25 '20 at 06:27
  • Also, given that the original error was solved, you may want to mark the answer as accepted and post a new question. This will likely get you more (and better) answers than what I may be able to provide. – Charles Carriere Apr 25 '20 at 06:29