0

We're loading a symbol from a shared library via dlsym() under GNU/Linux and obviously get some kind of race condition resulting in a segmentation fault. The backtrace looks something like this:

(gdb) backtrace
#0  do_lookup_x at dl-lookup.c:366
#1  _dl_lookup_symbol_x at dl-lookup.c:829
#2  do_sym at dl-sym.c:168
#3  _dl_sym at dl-sym.c:273
#4  dlsym_doit at dlsym.c:50
#5  _dl_catch_error at dl-error.c:187
#6  _dlerror_run at dlerror.c:163
#7  __dlsym at dlsym.c:70
#8  ... (our code)

My local machine uses glibc-2.23.

I discovered, that the library handle given to __dlsym() in frame #7 is different to the handle passed to _dlerror_run(). It runs wild in the following lines in dlsym.c:

void *
__dlsym (void *handle, const char *name DL_CALLER_DECL)
{
# ifdef SHARED
  if (__glibc_unlikely (_dlfcn_hook != NULL))
    return _dlfcn_hook->dlsym (handle, name, DL_CALLER);
# endif

  struct dlsym_args args;
  args.who = DL_CALLER;
  args.handle = handle; /* <------------------ this isn't my handle! */
  args.name = name;

  /* Protect against concurrent loads and unloads.  */
  __rtld_lock_lock_recursive (GL(dl_load_lock));

  void *result = (_dlerror_run (dlsym_doit, &args) ? NULL : args.sym);

  __rtld_lock_unlock_recursive (GL(dl_load_lock));

  return result;
}

GDB says

(gdb) frame 7
#7  __dlsym at dlsym.c:70
(gdb) p *(struct link_map *)args.handle
$36 = {l_addr= 140736951484536, l_name = 0x7fffe0000078 "\300\215\r\340\377\177", ...}

so this is obviously garbage. The same occurs in the higher frames, e.g. in frame #2:

(gdb) frame 2
#2  do_sym at dl-sym.c:168
(gdb) p handle
$38 = {l_addr= 140736951484536, l_name = 0x7fffe0000078 "\300\215\r\340\377\177", ...}

Unfortunately the parameter handle in frame #7 can't be displayed:

(gdb) p handle
$37 = <optimized out>

but surprisingly in frame #8 and further down in our code the handle was correct:

(gdb) frame 8
#8 ...
(gdb) p *(struct link_map *)libHandle
$38 = {l_addr = 140737160646656, l_name = 0x7fffd8005b60 "/path/to/libfoo.so", ...}

Now my conclusion is, that the variable args must be modified during the execution inside __dlsym() but I can't see where and why.

I have to confess, there's a second aspect to this problem: It only occurs in a multi-threaded environment and only sometimes. But as you can see, there are some counter measures for race conditions in the implementation of __dlsym() since they're calling __rtld_lock_(un)lock_recursive() and the local variable args isn't shared across threads. And curiously enough, the problem still persists, if I make frame #8 mutual exclusive among my threads.

Questions: What are possible sources for the discrepancy in the library handle between frame #8 and frame #7?

Question 2: Does dlopen() yield different values for different threads? Or to put it differently: Is it possible to share the handles returned by dlopen() between different threads.

Update: I thank everybody commenting on this question and trying to answer it despite the lack of almost any viable information to do so. I found the solution of this problem. As foreseen by the commenters, it was totaly unrelated to the stacktraces and other information I provided. Hence, I consider this question as closed and will flag it for deletion. So Long, and Thanks for All the Fish

phlipsy
  • 2,899
  • 1
  • 21
  • 37
  • In an optimized build local variables may not be available for inspection. Run your application under valgrind. The error is most probably in the code you do not present here, not in glibc. – Maxim Egorushkin Oct 24 '16 at 09:37
  • @Maxim: I almost completely agree with you, but nevertheless, it strikes me, that the structure `args` in `__dlsym()` does not contain the right data. I'll give valgrind a shot but I suspect that I'll drown in errors irrelevant to my problem. – phlipsy Oct 24 '16 at 09:46
  • Also try the thread sanitizer available in gcc and clang. – Maxim Egorushkin Oct 24 '16 at 09:48
  • Debugging optimized code might show false values... If you have no better idea, recompile libc with flags `-O0 -g` – Zsigmond Lőrinczy Oct 24 '16 at 17:11
  • @ZsigmondLőrinczy GLIBC can *not* be built with `-O0`. – Employed Russian Oct 25 '16 at 05:00
  • 2
    "backtrace looks something like this" -- you should provide (relevant part of) *actual* backtrace. In programming, details *matter*, and you've omitted pretty much all relevant details. – Employed Russian Oct 25 '16 at 05:01
  • @Employed Russian: I'm really sorry, I know you're right, in this case all details could matter. But unfortunately, I can't provide all information due to business obligations. Which further information could be of use here? – phlipsy Oct 25 '16 at 06:59
  • @Employed Russian: haven't tried yet. What would happen, if someone tried? Anyways, some distros have 'glibc-debug' package. – Zsigmond Lőrinczy Oct 26 '16 at 05:21
  • @ZsigmondLőrinczy You'll get an error to the effect that GLIBC can not be built without optimizations. "glibc-debug" that many distros provide is built for *fully-optimized* GLIBC. – Employed Russian Oct 26 '16 at 05:34

1 Answers1

1

What are possible sources for the discrepancy in the library handle between frame #8 and frame #7?

The most likely cause is mismatch between ld-linux.so and libdl.so. As stated in this answer, ld-linux and libdl must come from the same build of GLIBC, or bad things will happen.

The mismatch can come from (A) trying to point to a different libc build via LD_LIBRARY_PATH, or (B) by static linking of libdl.a into the program.

The (gdb) info shared should show you which libraries are currently loaded. If you see something other than installed system ld-linux and libdl, then (A) is likely your problem.

For (B), you probably got (and ignored) a linker warning to the effect that your program will require at runtime the same libc version that you used to link it. Contrary to popular belief, fully-static binaries are less portable on Linux, not more.

Community
  • 1
  • 1
Employed Russian
  • 199,314
  • 34
  • 295
  • 362
  • We're doing neither of this. We're using the system provided ld-linux.so and libdl.so and don't link against a static libdl.a. But this would be an interesting and hideous bug indeed: The used library handles came from different sources or something like this :-) – phlipsy Oct 25 '16 at 06:52