20

I'm trying to use sprof to profile some software (ossim) where almost all the code is in a shared library. I've generated a profiling file, but when I run sprof, I get the following error:

> sprof /home/eca7215/usr/lib/libossim.so.1 libossim.so.1.profile -p > log
Inconsistency detected by ld.so: dl-open.c: 612: _dl_open: Assertion `_dl_debug_initialize (0, args.nsid)->r_state == RT_CONSISTENT' failed!

The instructions I was following said that I needed libc version at least 2.5-34, I have libc version 2.12.2 (Gentoo, kernel 2.6.36-r5).

I can't find any explanation as to what the error means or (more interestingly) how to fix it, the only half-relevant google results are for a bug in an old version of Skype.

Edward
  • 1,786
  • 1
  • 15
  • 33
  • 3
    As far as I can tell, it's a bug in glibc, it shows up if you google RT_CONSISTENT and look at all the redhat bugzilla entries. I'm using oprofile instead now. – Matthew Smith Oct 05 '11 at 05:10
  • Dunno if it works, but there is some info in this answer about sprof usage for .so files: http://stackoverflow.com/questions/1838989/gprof-how-to-generate-call-graph-for-functions-in-shared-library-that-is-linke – Alexis Wilke Aug 02 '15 at 05:02

3 Answers3

7

I got a bit curious since this is still broken in OpenSuse 12.x. I would have thought a bug originally reported in '09 or so would have been fixed by now. I guess nobody really uses sprof. (or maybe dl-open is so fragile that people are scared to touch it :-)

The issue boils down to the __RTLD_SPROF flag used as argument to dlopen. Take any simple program that calls dlopen, or that flag to the second arg and you get the same failed assertion. I used the sample program at the bottom of http://linux.die.net/man/3/dlopen as an example

handle = dlopen(argv[1], RTLD_LAZY | __RTLD_SPROF);

From what I can tell from a quick look at dl-open.c, this flags short circuits some of what dl_open does. So the r_flag specified in the assertion doesn't get set to RT_CONSISTENT.

bpmelli
  • 86
  • 1
  • 2
  • 3
    This is kind of annoying. Many people suggest oprofile instead, but so far I hadn't been able to build that either. Do you have any suggestion on how to profile a shared library? – dirac3000 Sep 09 '13 at 17:04
4

I got this error with PyTorch DataLoader when using multiple workers. Python does multiprocessing by launching many processes and one of the process had this error while reading a file in read-only mode (for CIFAR10 dataset). Simply re-running the script solved the issue so I believe this some sort of sporadic rare OS error. With PyTorch if you set num_workers=0 that may also help resolve the error.

Below is the full error in case anyone is interested:

Inconsistency detected by ld.so dl-open.c   272 dl_open_worker  Assertion `_dl_debug_initialize (0, args->nsid)->r_state == RT_CONSISTENT' failed!
Traceback (most recent call last):
  File "/miniconda/envs/petridishpytorchcuda92/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 724, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/miniconda/envs/petridishpytorchcuda92/lib/python3.6/queue.py", line 173, in get
    self.not_empty.wait(remaining)
  File "/miniconda/envs/petridishpytorchcuda92/lib/python3.6/threading.py", line 299, in wait
    gotit = waiter.acquire(True, timeout)
  File "/miniconda/envs/petridishpytorchcuda92/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError    DataLoader worker (pid 272) exited unexpectedly with exit code 127. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.
Shital Shah
  • 63,284
  • 17
  • 238
  • 185
  • 1
    I also got this error, after switching from python 3.7 tot python 3.8, and from pytorch 1.4.0 to 1.7.1. Code is run inside a docker container. It seems not a sporadic error.... – Klamer Schutte Mar 05 '21 at 10:09
1

If you're using Docker, there could be another explanation. In my case the profiling data was generated from a process running inside a Docker container, I tried running sprof from within the container and received the same error as described in the question. Running sprof from the host (instead of the container) solved it.

Dan Shemesh
  • 464
  • 3
  • 13