Recently a colleague needed to use NVML to query device information, so I downloaded the Tesla development kit 3.304.5 and copied the file nvml.h to /usr/include. To test, I compiled the example code in tdk_3.304.5/nvml/example and it worked fine.
Over a weekend, something changed in the system (I cannot determine what was changed and I am not the only one with access to the machine) and now any code that uses nvml.h, such as the example code, fails with the following error:
Failed to initialize NVML:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:
You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. libnvidia-ml.so in TDK package is a stub library that is attached only for build purposes (e.g. machine that you build your application doesn't have to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
However, I can still run nvidia-smi and read information about my K20m's state, and as far as I am aware nvidia-smi is just a set of calls to nvml.h. The error message I receive is somewhat cryptic, but I believe it is telling me that the nvidia-ml.so file needs to match the Tesla driver that I have installed on my system. Just to ensure everything is correct, I re-downloaded CUDA 5.0 and installed the driver, CUDA runtime, and the test files. I am certain that the nvidia-ml.so file matches the driver (both are 304.54) so I am quite confused as to what could be going wrong. I can compile and run the test code with nvcc as well as run my own CUDA code, as long as it doesn't include nvml.h.
Has anyone encountered this error or have any thoughts on rectifying the issue?
$ ls -la /usr/lib/libnvidia-ml*
lrwxrwxrwx. 1 root root 17 Jul 19 10:08 /usr/lib/libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx. 1 root root 22 Jul 19 10:08 /usr/lib/libnvidia-ml.so.1 -> libnvidia-ml.so.304.54
-rwxr-xr-x. 1 root root 391872 Jul 19 10:08 /usr/lib/libnvidia-ml.so.304.54
$ ls -la /usr/lib64/libnvidia-ml*
lrwxrwxrwx. 1 root root 17 Jul 19 10:08 /usr/lib64/libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx. 1 root root 22 Jul 19 10:08 /usr/lib64/libnvidia-ml.so.1 -> libnvidia-ml.so.304.54
-rwxr-xr-x. 1 root root 394792 Jul 19 10:08 /usr/lib64/libnvidia-ml.so.304.54
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 304.54 Sat Sep 29 00:05:49 PDT 2012
GCC version: gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC)
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221
$ whereis nvml.h
nvml: /usr/include/nvml.h
$ ldd example
linux-vdso.so.1 => (0x00007fff2da66000)
libnvidia-ml.so.1 => /usr/lib64/libnvidia-ml.so.1 (0x00007f33ff6db000)
libc.so.6 => /lib64/libc.so.6 (0x000000300e400000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x000000300ec00000)
libdl.so.2 => /lib64/libdl.so.2 (0x000000300e800000)
/lib64/ld-linux-x86-64.so.2 (0x000000300e000000)
EDIT: The solution was to remove all extra instances of libnvidia-ml.so. For some reason there were a LOT of them.
$ sudo find / -name 'libnvidia-ml*'
/usr/lib/libnvidia-ml.so.304.54
/usr/lib/libnvidia-ml.so
/usr/lib/libnvidia-ml.so.1
/usr/opt/lib/libnvidia-ml.so
/usr/opt/lib/libnvidia-ml.so.1
/usr/opt/lib64/libnvidia-ml.so
/usr/opt/lib64/libnvidia-ml.so.1
/usr/opt/nvml/lib/libnvidia-ml.so
/usr/opt/nvml/lib/libnvidia-ml.so.1
/usr/opt/nvml/lib64/libnvidia-ml.so
/usr/opt/nvml/lib64/libnvidia-ml.so.1
/usr/lib64/libnvidia-ml.so.304.54
/usr/lib64/libnvidia-ml.so
/usr/lib64/libnvidia-ml.so.1
/lib/libnvidia-ml.so.old
/lib/libnvidia-ml.so.1