Cannot run CUDA code that queries NVML - error regarding libnvidia-ml.so

Question

Recently a colleague needed to use NVML to query device information, so I downloaded the Tesla development kit 3.304.5 and copied the file nvml.h to /usr/include. To test, I compiled the example code in tdk_3.304.5/nvml/example and it worked fine.

Over a weekend, something changed in the system (I cannot determine what was changed and I am not the only one with access to the machine) and now any code that uses nvml.h, such as the example code, fails with the following error:

Failed to initialize NVML:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:

You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. libnvidia-ml.so in TDK package is a stub library that is attached only for build purposes (e.g. machine that you build your application doesn't have to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

However, I can still run nvidia-smi and read information about my K20m's state, and as far as I am aware nvidia-smi is just a set of calls to nvml.h. The error message I receive is somewhat cryptic, but I believe it is telling me that the nvidia-ml.so file needs to match the Tesla driver that I have installed on my system. Just to ensure everything is correct, I re-downloaded CUDA 5.0 and installed the driver, CUDA runtime, and the test files. I am certain that the nvidia-ml.so file matches the driver (both are 304.54) so I am quite confused as to what could be going wrong. I can compile and run the test code with nvcc as well as run my own CUDA code, as long as it doesn't include nvml.h.

Has anyone encountered this error or have any thoughts on rectifying the issue?

$ ls -la /usr/lib/libnvidia-ml*
lrwxrwxrwx. 1 root root     17 Jul 19 10:08 /usr/lib/libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx. 1 root root     22 Jul 19 10:08 /usr/lib/libnvidia-ml.so.1 -> libnvidia-ml.so.304.54
-rwxr-xr-x. 1 root root 391872 Jul 19 10:08 /usr/lib/libnvidia-ml.so.304.54

$ ls -la /usr/lib64/libnvidia-ml*
lrwxrwxrwx. 1 root root     17 Jul 19 10:08 /usr/lib64/libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx. 1 root root     22 Jul 19 10:08 /usr/lib64/libnvidia-ml.so.1 -> libnvidia-ml.so.304.54
-rwxr-xr-x. 1 root root 394792 Jul 19 10:08 /usr/lib64/libnvidia-ml.so.304.54

$ cat /proc/driver/nvidia/version 
NVRM version: NVIDIA UNIX x86_64 Kernel Module  304.54  Sat Sep 29 00:05:49 PDT 2012
GCC version:  gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) 

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221

$ whereis nvml.h
nvml: /usr/include/nvml.h

$ ldd example
        linux-vdso.so.1 =>  (0x00007fff2da66000)
        libnvidia-ml.so.1 => /usr/lib64/libnvidia-ml.so.1 (0x00007f33ff6db000)
        libc.so.6 => /lib64/libc.so.6 (0x000000300e400000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x000000300ec00000)
        libdl.so.2 => /lib64/libdl.so.2 (0x000000300e800000)
        /lib64/ld-linux-x86-64.so.2 (0x000000300e000000)

EDIT: The solution was to remove all extra instances of libnvidia-ml.so. For some reason there were a LOT of them.

$ sudo find / -name 'libnvidia-ml*'
/usr/lib/libnvidia-ml.so.304.54
/usr/lib/libnvidia-ml.so
/usr/lib/libnvidia-ml.so.1
/usr/opt/lib/libnvidia-ml.so
/usr/opt/lib/libnvidia-ml.so.1
/usr/opt/lib64/libnvidia-ml.so
/usr/opt/lib64/libnvidia-ml.so.1
/usr/opt/nvml/lib/libnvidia-ml.so
/usr/opt/nvml/lib/libnvidia-ml.so.1
/usr/opt/nvml/lib64/libnvidia-ml.so
/usr/opt/nvml/lib64/libnvidia-ml.so.1
/usr/lib64/libnvidia-ml.so.304.54
/usr/lib64/libnvidia-ml.so
/usr/lib64/libnvidia-ml.so.1
/lib/libnvidia-ml.so.old
/lib/libnvidia-ml.so.1

score 7 · Accepted Answer · answered Jul 22 '13 at 15:53

7

You are getting this error because the application that is trying to use nvml is loading the stub library that is located in:

...tdk_install_path/lib64/libnvidia-ml.so

instead of the one in:

/usr/lib64/libnvidia-ml.so

I was able to reproduce your error when I added the stub library path to my LD_LIBRARY_PATH environment variable. So that is one possible source of error, if someone added the path of the stub library that comes with the tdk distribution to your LD_LIBRARY_PATH environment variable, but probably not the only way this could happen. If someone in an unusual fashion copied the stub library to some system path, that might also be an issue.

You'll need to try and figure out why your system is loading that stub library in place of the correct one in /usr/lib64. Alternatively, for discovery purposes, you could try deleting all instances of the stub library anywhere on your system (leave the correct libraries in /usr/lib and /usr/lib64 alone), and you should be able to observe correct behavior.

answered Jul 22 '13 at 15:53

Robert Crovella

143,785
11
213
257

Thank you for the response. I have looked at my LD_LIBRARY_PATH and I haven't found any reference to the tdk directory. Doing a whereis on libnvidia-ml.so only gives me the /usr/lib and /usr/lib64 directories, so I'm assuming that there is no conflict going on there but I'm by no stretch a Linux expert so I may be missing something. Is there another path that would be defined other than LD_LIBRARY_PATH that might have an erroneous reference? Perhaps there is a good way for me to check and see if the stub .so is being used instead? I also rm the lib and lib64 directories from the tdk directory. – Brian R Jul 22 '13 at 16:15
Which operating system do you have? If you have an application that is built to use nvml, but is not working correctly, what happens when you run the following command: `ldd myapp` ? (change `myapp` to whatever is the name of your compiled executable) That is to say, please edit your question with the output of that `ldd` command. – Robert Crovella Jul 22 '13 at 16:27
I'm running centos 6.0. I didn't think to check ldd, but I will post the results now. Luckily, your advice did help me solve the issue. For whatever reason, doing a whereis only gave me the two /usr/lib* directories, but doing a find / libnvidia-ml* found a bunch of entries scattered all over the filesystem. Removing them all has solved the problem. Kudus to you! – Brian R Jul 22 '13 at 16:36

score 1 · Answer 2 · answered Jul 11 '17 at 15:00

1

I solved the problem this way on a GTX 1070 using windows 10 : go to device manager, select the GPU that is having a problem, disable the GPU and enable back.

answered Jul 11 '17 at 15:00

Pro7ech

11
1

This worked for me on win10 as well. At least I don't have to log out and back in again. Does anyone have a solution that is permanent? – Soenhay Mar 22 '18 at 02:10

Soenhay · Answer 3 · 2018-04-23T23:46:11.237

I was having this same or similar issue with EWBF Cuda Miner for zCash.

Here is a way to automatically implement Pro7ech's answer (which worked for me) for WIN10:

Install WDK for Windows 10 if you don't already have it: This will give you the ability to use devcon.exe which allows manipulation of devices via batch scripts: https://learn.microsoft.com/en-us/windows-hardware/drivers/download-the-wdk

You might also need the Windows SDK if you don't have visual studio with Desktop development with C++ workload: https://developer.microsoft.com/en-us/windows/downloads/windows-10-sdk

To make things easier, you might want to add the installation path to your PATH environment variable: https://www.howtogeek.com/118594/how-to-edit-your-system-path-for-easy-command-line-access/

Devcon.exe was installed here for me:

C:\Program Files (x86)\Windows Kits\10\Tools\x64

So now run this or similar in a cmd.exe prompt to get the device id:

devcon findall * | find /i "nvidia"

Here is what mine looks like:

C:\Users\Soenhay>devcon findall * | find /i "nvidia"
HDAUDIO\FUNC_01&VEN_10DE&DEV_0083&SUBSYS_38426674&REV_1001\5&1C277AD4&0&0001: NVIDIA High Definition Audio
SWD\MMDEVAPI\{0.0.0.00000000}.{574980C3-9747-42EF-A78C-4C304E070B81}: SAMSUNG (NVIDIA High Definition Audio)
ROOT\UNNAMED_DEVICE\0000                                    : NVIDIA Virtual Audio Device (Wave Extensible) (WDM)
PCI\VEN_10DE&DEV_1B81&SUBSYS_66743842&REV_A1\4&1F1337ch33s3&0&0000: NVIDIA GeForce GTX 1070

From that I see that my graphics device id is:

PCI\VEN_10DE&DEV_1B81&SUBSYS_66743842&REV_A1\4&1F1337ch33s3&0&0000

So I create a batch file with the following to disable and re-enable the driver:

devcon disable "@PCI\VEN_10DE&DEV_1B81&SUBSYS_66743842&REV_A1\4&1F1337ch33s3&0&0000"
devcon enable "@PCI\VEN_10DE&DEV_1B81&SUBSYS_66743842&REV_A1\4&1F1337ch33s3&0&0000"

Now, when I get the NVML error when starting the miner I just run this batch file and it fixes it. You could also just add those 2 lines to the beginning of your start.bat file to do this every time but I found that the error does not always happen every time I restart the miner time now.

References:

superuser post

devcon commands

devcon examples

No matching devices found.

NOTE: The command should have the @ symbol at the beginning of the device id. The batch script should be run as administrator.