As mxnet nvidia froums and github does not answer questions properly and also they don't have good community like stackoverflow I ask this question here.
Host:
Linux XYZ 5.15.0-76-generic #83~20.04.1-Ubuntu SMP Wed Jun 21 20:23:31 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Host Nvidia Driver Version:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 26% 35C P8 6W / 75W | 265MiB / 4096MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1063 G /usr/lib/xorg/Xorg 92MiB |
| 0 N/A N/A 1338 G /usr/bin/gnome-shell 26MiB |
| 0 N/A N/A 2078 G /usr/lib/firefox/firefox 144MiB |
+-----------------------------------------------------------------------------+
and cuda version is 11.6
USER@XYZ:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Thu_Feb_10_18:23:41_PST_2022
Cuda compilation tools, release 11.6, V11.6.112
Build cuda_11.6.r11.6/compiler.30978841_0
Problem:
I installed nvidia-container-toolkit to run containers by enabling nvidia.
I have pulled mxnet/python:1.9.1_gpu_cu112_py3
docker image. Then I want to check if mxnet uses my gpu?. So I run docker container
docker run -it --runtime=nvidia --gpus all mxnet/python:1.9.1_gpu_cu112_py3 /bin/bash
Then by checking nvidia driver version in container to be ensure that container runs with gpus:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 26% 34C P8 6W / 75W | 278MiB / 4096MiB | 5% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Which shows cuda version is 11.2 . Now, in docker container I check mxnet by running these codes in python3.7
:
import mxnet as mx
mx.context.num_gpus()
But, I have got this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.7/dist-packages/mxnet/context.py", line 275, in num_gpus
check_call(_LIB.MXGetGPUCount(ctypes.byref(count)))
File "/usr/local/lib/python3.7/dist-packages/mxnet/base.py", line 246, in check_call
raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
File "../include/mxnet/base.h", line 458
CUDA: Check failed: e == cudaSuccess (803 vs. 0) : system has unsupported display driver / cuda driver combination
So, mxnet could not identifying my GPUs.
It seems that there is a problem with cuda version in host and container.
By googling I found a solution to resolve this problem: system has unsupported display driver / cuda driver combination. I remove all libcuda.so
and it's symlink in container in the directory /usr/local/cuda-11.2/compat/
and finally testing mxnet
successfully:
Python 3.7.13 (default, Apr 24 2022, 01:04:09)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet as mx
>>> mx.context.num_gpus()
1
My question:
By removing libcuda.so in container cuda version in container is 11.6 :
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 26% 35C P8 6W / 75W | 282MiB / 4096MiB | 4% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
It seems that nvidia-container-toolkit
binds libcuda.so
from host. But, it makes a dependency between host and my container and it is not desired.
How could I remove this dependency?
Is there any way to edit configuration file of
nvidia-container-toolkit
to not bindlibcuda.so
with host?Is this problem specific to
mxnet
framework?
Thanks in advance.