5

I am building a library (Ubuntu 22) that uses onnxruntime under the hood. In turn, onnxruntime uses CUDA, dynamically loading some dedicated "backend". I build the whole code stack except the CUDA libraries, and none of the libraries have their RPATH or RUNPATH set (double-checked with readelf -d).

I build two apps, one is C++, and directly links to my library. The app has its RPATH set and everything works fine. If I run it with LD_DEBUG=libs I see stuff like this (note that the paths are edited and I'm showing only a tiny fraction of the debug output):

    158834:     calling init: .../install/bin/../lib/libonnxruntime_providers_cuda.so
    158834:
    158834:     find library=libcudnn_ops_infer.so.8 [0]; searching
    158834:      search path=.../install/bin/../lib         (RPATH from file .../install/bin/test)
    158834:       trying file=.../install/bin/../lib/libcudnn_ops_infer.so.8
    158834:
    158834:
    158834:     calling init: .../install/bin/../lib/libcudnn_ops_infer.so.8
    158834:

This is what I expect, I'm happy.

However, I also need to use the very same library through some python bindings that link against it. To have it working, I need to set in this case the RPATH of the python bindings (which, in my understanding at least, are just a shared library that gets loaded at runtime). Note that the Python executable doesn't have neither RPATH nor RUNPATH set. This works only in part. Namely, RPATH propagation seems to work while walking down the dependency tree until it starts searching for the CUDA libraries, at that point it doesn't work any more. This is running exactly the same onnxruntime API in the same way, same build, with the same files in the same folder as above. The only difference is the python extension layer. The LD_DEBUG output looks like this:

    159602:     find library=libonnxruntime.so.1.15.1 [0]; searching
    159602:      search path=.../install/lib/../lib         (RPATH from file .../install/lib/pyext.cpython-310-x86_64-linux-gnu.so)
    159602:       trying file=.../install/lib/../lib/libonnxruntime.so.1.15.1

[...]

    159602:     calling init: .../install/lib/pyext.cpython-310-x86_64-linux-gnu.so
    159602:
    159602:     find library=libonnxruntime_providers_shared.so [0]; searching
    159602:      search path=.../install/lib/../lib         (RPATH from file .../install/lib/pyext.cpython-310-x86_64-linux-gnu.so)
    159602:       trying file=.../install/lib/../lib/libonnxruntime_providers_shared.so
    159602:
    159602:
    159602:     calling init: .../install/lib/../lib/libonnxruntime_providers_shared.so
    159602:
    159602:     find library=libonnxruntime_providers_cuda.so [0]; searching
    159602:      search path=.../install/lib/../lib         (RPATH from file .../install/lib/pyext.cpython-310-x86_64-linux-gnu.so)
    159602:       trying file=.../install/lib/../lib/libonnxruntime_providers_cuda.so
    159602:
    159602:     find library=libcublas.so.11 [0]; searching
    159602:      search cache=/etc/ld.so.cache
    159602:      search path=/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v3:/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v2:/lib/x86_64-linux-gnu/tls/haswell/x86_64:/lib/x
86_64-linux-gnu/tls/haswell:/lib/x86_64-linux-gnu/tls/x86_64:/lib/x86_64-linux-gnu/tls:/lib/x86_64-linux-gnu/haswell/x86_64:/lib/x86_64-linux-gnu/haswell:/lib/x86_64-
linux-gnu/x86_64:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v3:/usr/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v2:/usr/lib/x86_64-linux-gnu/tls
/haswell/x86_64:/usr/lib/x86_64-linux-gnu/tls/haswell:/usr/lib/x86_64-linux-gnu/tls/x86_64:/usr/lib/x86_64-linux-gnu/tls:/usr/lib/x86_64-linux-gnu/haswell/x86_64:/usr
/lib/x86_64-linux-gnu/haswell:/usr/lib/x86_64-linux-gnu/x86_64:/usr/lib/x86_64-linux-gnu:/lib/glibc-hwcaps/x86-64-v3:/lib/glibc-hwcaps/x86-64-v2:/lib/tls/haswell/x86_
64:/lib/tls/haswell:/lib/tls/x86_64:/lib/tls:/lib/haswell/x86_64:/lib/haswell:/lib/x86_64:/lib:/usr/lib/glibc-hwcaps/x86-64-v3:/usr/lib/glibc-hwcaps/x86-64-v2:/usr/li
b/tls/haswell/x86_64:/usr/lib/tls/haswell:/usr/lib/tls/x86_64:/usr/lib/tls:/usr/lib/haswell/x86_64:/usr/lib/haswell:/usr/lib/x86_64:/usr/lib            (system search
 path)
    159602:       trying file=/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v3/libcublas.so.11
    159602:       trying file=/lib/x86_64-linux-gnu/glibc-hwcaps/x86-64-v2/libcublas.so.11
    159602:       trying file=/lib/x86_64-linux-gnu/tls/haswell/x86_64/libcublas.so.11

 [...]

    159602:     calling fini: .../install/lib/../lib/libonnxruntime_providers_shared.so [0]

So basically libcublas is not found (nor any other of the CUDA libs), triggering a fallback mechanism in onnxruntime that avoids using CUDA.

Why does RPATH propagation work for the C++ app but not for the Python extension? Is there something silly I'm missing, or is it something deep related to how libraries are loaded in the context of a python session? Can it be the weird manifestation of a bug in onnxruntime, maybe doing something wrong with dlopen?

Note that the same issue seems to be present in the Python version of onnxruntime itself: Their setup.py makes sure that all dependencies are pre-loaded, using ctypes.CDLL with RTLD_GLOBAL.

ajc
  • 365
  • 2
  • 18
  • Possibly related onnxruntime issue https://github.com/microsoft/onnxruntime/issues/9309 – ajc Jul 24 '23 at 09:58

0 Answers0