2

As mxnet nvidia froums and github does not answer questions properly and also they don't have good community like stackoverflow I ask this question here.

Host:

Linux XYZ 5.15.0-76-generic #83~20.04.1-Ubuntu SMP Wed Jun 21 20:23:31 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Host Nvidia Driver Version:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 26%   35C    P8     6W /  75W |    265MiB /  4096MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1063      G   /usr/lib/xorg/Xorg                 92MiB |
|    0   N/A  N/A      1338      G   /usr/bin/gnome-shell               26MiB |
|    0   N/A  N/A      2078      G   /usr/lib/firefox/firefox          144MiB |
+-----------------------------------------------------------------------------+

and cuda version is 11.6

USER@XYZ:~$ nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Thu_Feb_10_18:23:41_PST_2022
Cuda compilation tools, release 11.6, V11.6.112
Build cuda_11.6.r11.6/compiler.30978841_0

Problem:

I installed nvidia-container-toolkit to run containers by enabling nvidia. I have pulled mxnet/python:1.9.1_gpu_cu112_py3 docker image. Then I want to check if mxnet uses my gpu?. So I run docker container

docker run -it --runtime=nvidia --gpus all mxnet/python:1.9.1_gpu_cu112_py3 /bin/bash

Then by checking nvidia driver version in container to be ensure that container runs with gpus:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 26%   34C    P8     6W /  75W |    278MiB /  4096MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Which shows cuda version is 11.2 . Now, in docker container I check mxnet by running these codes in python3.7:

import mxnet as mx
mx.context.num_gpus()

But, I have got this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/dist-packages/mxnet/context.py", line 275, in num_gpus
    check_call(_LIB.MXGetGPUCount(ctypes.byref(count)))
  File "/usr/local/lib/python3.7/dist-packages/mxnet/base.py", line 246, in check_call
    raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
  File "../include/mxnet/base.h", line 458
CUDA: Check failed: e == cudaSuccess (803 vs. 0) : system has unsupported display driver / cuda driver combination

So, mxnet could not identifying my GPUs.

It seems that there is a problem with cuda version in host and container.

By googling I found a solution to resolve this problem: system has unsupported display driver / cuda driver combination. I remove all libcuda.so and it's symlink in container in the directory /usr/local/cuda-11.2/compat/ and finally testing mxnet successfully:

Python 3.7.13 (default, Apr 24 2022, 01:04:09) 
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet as mx
>>> mx.context.num_gpus()
1

My question:

By removing libcuda.so in container cuda version in container is 11.6 :

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 26%   35C    P8     6W /  75W |    282MiB /  4096MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

It seems that nvidia-container-toolkit binds libcuda.so from host. But, it makes a dependency between host and my container and it is not desired.

  1. How could I remove this dependency?

  2. Is there any way to edit configuration file of nvidia-container-toolkit to not bind libcuda.so with host?

  3. Is this problem specific to mxnet framework?

Thanks in advance.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • Why would you want to remove that dependency? For what it's worth, binding the host's `libcuda.so` into the container (along with all the other components of NVidia's drivers) is *highly desireable*, since the drivers must match exactly the version of the kernel module running on the *host*. If I were in your position, I'd move heaven and hell to get rid of any GPU related libraries and shared objects inside the container, for exactly the driver mismatch problems you experienced. – datenwolf Jul 15 '23 at 12:02
  • @datenwolf, I tested multi docker images with other frameworks like `Tensorflow` or `Pytorch` and for none of them I didn't have a such problem. I'm expecting to do not deleting any files in container. Removing GPU related libraries and shared objects inside the container is another challenge and I tried to edit binding files of `nvidia-container-toolkit` like in `/etc/cdi/nvidia.yaml` but I don't get any success. – gunner gunner Jul 15 '23 at 12:23
  • Looks like you didn't downgrade CUDA correctly. Check out this tutorial: https://www.youtube.com/watch?v=5eJTzhGe2QE – SickerDude43 Jul 17 '23 at 13:43
  • @SickerDude43 , why should I downgrade CUDA? If I have a RTX 4090 and I want to run docker container with cuda 10.1 , how could I install cuda 10.1 on RTX 4090? Why should I ignore those capability of the newest cuda version and downgrade cuda on host? – gunner gunner Jul 18 '23 at 05:44
  • @gunnergunner Sorry misread the question. The first error was caused by the wrong CUDA version. The development came to a halt in the end of last year. So I don't expect them to release any support for them soon. Another workaround would be to either downgrade CUDA or switch to another framework – SickerDude43 Jul 20 '23 at 12:05
  • @SickerDude43, I have tested multiple docker container and we do not have such challenge for Pytorch, Tensorflow frameworks. – gunner gunner Jul 20 '23 at 12:48
  • @gunnergunner Honestly I'm not getting what you want to do. Like datenwolf said I also don't understand why you want to remove the "dependency" that host and container have the same driver versions? Also you already solved the problem and since I assume MXNet runs on the host with the host's hardware, why shouldn't it bind the host's libcuda as well? – SickerDude43 Jul 21 '23 at 11:54

0 Answers0