1

I am trying to compile a .cpp application which depends on LibTorch, the cpp version of PyTorch (https://pytorch.org/) on a HPC server.

I have loaded CUDA 11.8 via a module load.

nvcc -V outputs

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

With or without the CUDA module loaded, nvidia-smi outputs:

Tue May 23 22:12:13 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:01:00.0 Off |                    0 |
| N/A   27C    P0    52W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I have loaded CMake via a module load. Version 3.23.1.

I have loaded GCC-12.2.0 via a module load.

I downloaded libtorch from the official website and unzipped the archive. The latest release, called libtorch-shared-with-deps-2.0.1+cu118.zip

I created a CMakeLists.txt file just as recommended the libtorch documentation.

I use

$ cmake -DCMAKE_PREFIX_PATH=path_to_libtorch_folder ..

The CMakeLists.txt is:

cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
project(example-app)

set(CMAKE_C_COMPILER "gcc")
set(CMAKE_CXX_COMPILER "g++")

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -pedantic -Wall")


find_package(Torch REQUIRED)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")

find_package(OpenMP)

add_executable(example-app example-app.cpp)

target_include_directories(example-app PUBLIC "/work4/clf/ouatu/trial_Murakami_CPP_SHTNS_PMD_LibTorch/shtns_install_omp_GNU/include")
target_link_directories(example-app PUBLIC "/work4/clf/ouatu/trial_Murakami_CPP_SHTNS_PMD_LibTorch/shtns_install_omp_GNU/lib")

target_link_directories(example-app PUBLIC "/opt/NVIDIA/NVIDIA-Linux-x86_64-460.73.01/")

if(OpenMP_CXX_FOUND)
target_link_libraries(example-app PUBLIC "${TORCH_LIBRARIES}" OpenMP::OpenMP_CXX fftw3_omp fftw3 m shtns_omp)
endif()

set_property(TARGET example-app PROPERTY CXX_STANDARD 17)

with its output:

-- The C compiler identification is GNU 11.3.0
-- The CXX compiler identification is GNU 11.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /apps20/sw/amd/GCCcore/11.3.0/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /apps20/sw/amd/GCCcore/11.3.0/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Warning (dev) at libtorch/share/cmake/Caffe2/public/cuda.cmake:29 (find_package):
  Policy CMP0074 is not set: find_package uses <PackageName>_ROOT variables.
  Run "cmake --help-policy CMP0074" for policy details.  Use the cmake_policy
  command to set the policy and suppress this warning.

  Environment variable CUDA_ROOT is set to:

    /apps20/sw/amd/CUDA/11.8.0

  For compatibility, CMake is ignoring the variable.
Call Stack (most recent call first):
  libtorch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
  libtorch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:10 (find_package)
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Found CUDA: /apps20/sw/amd/CUDA/11.8.0 (found version "11.8")
-- The CUDA compiler identification is NVIDIA 11.8.89
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /apps20/sw/amd/CUDA/11.8.0/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Caffe2: CUDA detected: 11.8
-- Caffe2: CUDA nvcc is: /apps20/sw/amd/CUDA/11.8.0/bin/nvcc
-- Caffe2: CUDA toolkit directory: /apps20/sw/amd/CUDA/11.8.0
-- Caffe2: Header version is: 11.8
-- /apps20/sw/amd/CUDA/11.8.0/lib/libnvrtc.so shorthash is 672ee683
-- USE_CUDNN is set to 0. Compiling without cuDNN support
-- Autodetected CUDA architecture(s):  8.0
-- Added CUDA NVCC flags for: -gencode;arch=compute_80,code=sm_80
-- Found Torch: /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libtorch.so
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- Configuring done
CMake Warning at CMakeLists.txt:15 (add_executable):
  Cannot generate a safe runtime search path for target example-app because
  files in some directories may conflict with libraries in implicit
  directories:

    runtime library [libnvrtc.so.11.2] in /apps20/sw/amd/CUDA/11.8.0/lib may be hidden by files in:
      /apps20/sw/amd/CUDA/11.8.0/lib/stubs
    runtime library [libcufft.so.10] in /apps20/sw/amd/CUDA/11.8.0/lib64 may be hidden by files in:
      /apps20/sw/amd/CUDA/11.8.0/lib/stubs
    runtime library [libcurand.so.10] in /apps20/sw/amd/CUDA/11.8.0/lib64 may be hidden by files in:
      /apps20/sw/amd/CUDA/11.8.0/lib/stubs
    runtime library [libcublas.so.11] in /apps20/sw/amd/CUDA/11.8.0/lib64 may be hidden by files in:
      /apps20/sw/amd/CUDA/11.8.0/lib/stubs
    runtime library [libcublasLt.so.11] in /apps20/sw/amd/CUDA/11.8.0/lib64 may be hidden by files in:
      /apps20/sw/amd/CUDA/11.8.0/lib/stubs

  Some of these libraries may not be found correctly.

then I do:

$ cmake --build . --config Release

with its output:

Consolidate compiler generated dependencies of target example-app
[ 50%] Linking CXX executable example-app
[100%] Built target example-app

I then run $ ./example-app and the output of std::cout << torch::cuda::is_available() << std::endl; is 0, thus the GPU is not recognised. Also, a warning is output to the screen:

[W CUDAFunctions.cpp:109] Warning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (function operator())

From searching on the internet, it seems that at runtime the loader finds a stub library and not the driver library.

I do not know how to solve this.

The directory structure of where $ module load CUDA/11.8.0 points to is such that the stubs folder is a subfolder of /apps20/sw/amd/CUDA/11.8.0/lib/.

But LD_LIBRARY_PATH is not recursive, isn't it? Thus the option 2) presented here CMake cannot resolve runtime directory path is of no use to me.

Anyhow, the output of $echo $LD_LIBRARY_PATH is:

/apps20/sw/amd/CUDA/11.8.0/nvvm/lib64:/apps20/sw/amd/CUDA/11.8.0/extras/CUPTI/lib64:/apps20/sw/amd/CUDA/11.8.0/lib:/apps20/sw/amd/libarchive/3.6.1-GCCcore-11.3.0/lib:/apps20/sw/amd/XZ/5.2.5-GCCcore-11.3.0/lib:/apps20/sw/amd/cURL/7.83.0-GCCcore-11.3.0/lib:/apps20/sw/amd/OpenSSL/1.1/lib:/apps20/sw/amd/bzip2/1.0.8-GCCcore-11.3.0/lib:/apps20/sw/amd/zlib/1.2.12-GCCcore-11.3.0/lib:/apps20/sw/amd/ncurses/6.3-GCCcore-11.3.0/lib:/apps20/sw/amd/GCCcore/11.3.0/lib64:/apps20/sw/amd/binutils/2.39-GCCcore-12.2.0/lib

For completeness, the output of echo $PATH is:

/apps20/sw/amd/CUDA/11.8.0/nvvm/bin:/apps20/sw/amd/CUDA/11.8.0/bin:/apps20/sw/amd/CMake/3.23.1-GCCcore-11.3.0/bin:/apps20/sw/amd/libarchive/3.6.1-GCCcore-11.3.0/bin:/apps20/sw/amd/XZ/5.2.5-GCCcore-11.3.0/bin:/apps20/sw/amd/cURL/7.83.0-GCCcore-11.3.0/bin:/apps20/sw/amd/OpenSSL/1.1/bin:/apps20/sw/amd/bzip2/1.0.8-GCCcore-11.3.0/bin:/apps20/sw/amd/ncurses/6.3-GCCcore-11.3.0/bin:/apps20/sw/amd/GCCcore/11.3.0/bin:/apps20/sw/amd/binutils/2.39-GCCcore-12.2.0/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/local/fluka/bin:/opt/ibm/platform_mpi/bin:/home/vol02/scarf1032/.local/bin:/home/vol02/scarf1032/bin

And the output of echo $CUDA_HOME is:

/apps20/sw/amd/CUDA/11.8.0

Similarly, option 1) is of no use to me, I cannot delete anything on the cluster. I have tried to $ module unload CUDA/11.8.0 before running the compiled app, but the compiled app does not run anymore if I do the module unload CUDA/11.8.0, failing with ./example-app: error while loading shared libraries: libnvToolsExt.so.1: cannot open shared object file: No such file or directory.

How could I run my compiled C++ app with it seeing the correct CUDA-driver libraries and not stub libraries?

I believe the driver libraries are at: /opt/NVIDIA/NVIDIA-Linux-x86_64-460.73.01/32/, folder having the contents:

libEGL.so.1.1.0                   libGLX_nvidia.so.460.73.01        libnvidia-compiler.so.460.73.01   libnvidia-ml.so.460.73.01
libEGL_nvidia.so.460.73.01        libGLdispatch.so.0                libnvidia-eglcore.so.460.73.01    libnvidia-opencl.so.460.73.01
libGL.so.1.7.0                    libOpenCL.so.1.0.0                libnvidia-encode.so.460.73.01     libnvidia-opticalflow.so.460.73.01
libGLESv1_CM.so.1.2.0             libOpenGL.so.0                    libnvidia-fbc.so.460.73.01        libnvidia-ptxjitcompiler.so.460.73.01
libGLESv1_CM_nvidia.so.460.73.01  libcuda.so.460.73.01              libnvidia-glcore.so.460.73.01     libnvidia-tls.so.460.73.01
libGLESv2.so.2.1.0                libglvnd_install_checker          libnvidia-glsi.so.460.73.01       libvdpau_nvidia.so.460.73.01
libGLESv2_nvidia.so.460.73.01     libnvcuvid.so.460.73.01           libnvidia-glvkspirv.so.460.73.01
libGLX.so.0                       libnvidia-allocator.so.460.73.01  libnvidia-ifr.so.460.73.01

EDIT: I checked with $ ldd example-app and indeed stubs appear (for example, line 4 shows: libcuda.so.1 => /apps20/sw/amd/CUDA/11.8.0/lib/stubs/libcuda.so.1 ):

linux-vdso.so.1 =>  (0x00007ffc29795000)
libtorch.so => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libtorch.so (0x00002ba944773000)
libc10.so => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libc10.so (0x00002ba944792000)
libcuda.so.1 => /apps20/sw/amd/CUDA/11.8.0/lib/stubs/libcuda.so.1 (0x00002ba944973000)
libnvrtc.so.11.2 => /apps20/sw/amd/CUDA/11.8.0/lib/stubs/libnvrtc.so.11.2 (0x00002ba944b82000)
libnvToolsExt.so.1 => /apps20/sw/amd/CUDA/11.8.0/lib/libnvToolsExt.so.1 (0x00002ba944d84000)
libcudart.so.11.0 => /apps20/sw/amd/CUDA/11.8.0/lib/libcudart.so.11.0 (0x00002ba944f8e000)
libc10_cuda.so => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libc10_cuda.so (0x00002ba94483e000)
libfftw3_omp.so.3 => /lib64/libfftw3_omp.so.3 (0x00002ba945235000)
libfftw3.so.3 => /lib64/libfftw3.so.3 (0x00002ba94543c000)
libtorch_cpu.so => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libtorch_cpu.so (0x00002ba9457c1000)
libtorch_cuda.so => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libtorch_cuda.so (0x00002ba95ed16000)
libcublas.so.11 => /apps20/sw/amd/CUDA/11.8.0/lib/stubs/libcublas.so.11 (0x00002ba9ad126000)
libgomp.so.1 => /apps20/sw/amd/GCCcore/11.3.0/lib64/libgomp.so.1 (0x00002ba9ad334000)
libstdc++.so.6 => /apps20/sw/amd/GCCcore/11.3.0/lib64/libstdc++.so.6 (0x00002ba9ad37a000)
libm.so.6 => /lib64/libm.so.6 (0x00002ba9ad58e000)
libgcc_s.so.1 => /apps20/sw/amd/GCCcore/11.3.0/lib64/libgcc_s.so.1 (0x00002ba9ad890000)
libc.so.6 => /lib64/libc.so.6 (0x00002ba9ad8aa000)
/lib64/ld-linux-x86-64.so.2 (0x00002ba94474f000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002ba9adc78000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ba9ade7c000)
librt.so.1 => /lib64/librt.so.1 (0x00002ba9ae098000)
libgomp-a34b3233.so.1 => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libgomp-a34b3233.so.1 (0x00002ba9ae2a0000)
libcudart-d0da41ae.so.11.0 => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libcudart-d0da41ae.so.11.0 (0x00002ba9ae4ca000)
libnvToolsExt-847d78f2.so.1 => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libnvToolsExt-847d78f2.so.1 (0x00002ba9ae775000)
libcudnn.so.8 => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libcudnn.so.8 (0x00002ba9ae980000)
libcublas-3b81d170.so.11 => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libcublas-3b81d170.so.11 (0x00002ba9aeba6000)
libcublasLt-b6d14a74.so.11 => /work4/clf/ouatu/trial_email_scarfhpcsupp/libtorch/lib/libcublasLt-b6d14a74.so.11 (0x00002ba9b4808000)
velenos14
  • 534
  • 4
  • 13
  • 2
    Note, that setting a compiler after the `project()` is plain wrong: https://stackoverflow.com/a/63944545/3440745. "From searching on the internet, it seems that at runtime the loader finds a stub library and not the driver library." - You could easily check your guesses about libraries found by the loader with `ldd example-app`. – Tsyvarev May 23 '23 at 21:43
  • @Tsyvarev, indeed, the libraries found by the loader are stubs! I will update my answer with this new information. I will also read about CMake and change where I define the compilers, please excuse my plain inability to use it correctly - it's the first time I came across writing a CMakeLists.txt due to LibTorch docu online – velenos14 May 23 '23 at 21:46
  • Having changed where I define the compilers (s.t. the correct GCC 11.3.0 is found and not the system wide version 4.5.0 or something similar), the same problem appears: the loader, at runtime, finds nvidia stubs and not the driver libraries. – velenos14 May 23 '23 at 21:53
  • Plainly you need to set the LD_LIBRARY_PATH correctly for the system you are running this code on – talonmies May 23 '23 at 22:47
  • @talonmies, thank you. I tried to ``$ unset LD_LIBRARY_PATH``, then run the compiled application, of course failed. Then I tried to manually remove only the first 3 entries of ``LD_LIBRARY_PATH``, i.e. ``/apps20/sw/amd/CUDA/11.8.0/nvvm/lib64:/apps20/sw/amd/CUDA/11.8.0/extras/CUPTI/lib64:/apps20/sw/amd/CUDA/11.8.0/lib:``, then run the app, it complains that ``./example-app: error while loading shared libraries: libnvToolsExt.so.1: cannot open shared object file: No such file or directory``, which is located in ``/apps20/sw/amd/CUDA/11.8.0/lib``, where a subfolder is called ``stubs`` ... – velenos14 May 23 '23 at 23:02
  • @talonmies, I then tried to leave only ``/apps20/sw/amd/CUDA/11.8.0/lib`` at the end of ``LD_LIBRARY_PATH`` and remove only the other 2 entries, the compiled application runs, but it warns me that it doesn't see the GPU and we are back to the initial problem ... – velenos14 May 23 '23 at 23:03
  • 2
    if the runtime loader is finding the stub path, you have a problem. If you have ruled out errant `LD_LIBRARY_PATH` settings, then the runtime loader has its own set of "default" paths to check, independent of your `LD_LIBRARY_PATH` setting. If someone put the stub library paths in that (via [ldconfig](https://man7.org/linux/man-pages/man8/ldconfig.8.html), as root) then I'm not sure you would be able to fix that, unless you are root. – Robert Crovella May 24 '23 at 00:31
  • @RobertCrovella, thank you. what would be the solution then, as being the root? – velenos14 May 24 '23 at 08:47

1 Answers1

1

The system administrator solved the problem.

In case anyone is having the problem I posted above, the solution was to use, when creating the cpp application using CMake, the following flags:

cmake -DCMAKE_PREFIX_PATH<path_to_your_libtorch> -D CUDA_CUDA_LIB=/usr/lib64/libcuda.so ..

This forces to link against the NVidia driver version of libcuda.so.

After this, when I print inside my cpp application: std::cout << torch::cuda::is_available() << std::endl;, this outputs 1, rather than 0, as it did before.

The warning also disappears.

velenos14
  • 534
  • 4
  • 13
  • Thanks for this question and answer and enduring the entirely unwarranted toxicity from stock overflow devs – Anaryl Jul 07 '23 at 10:35