We're loading a symbol from a shared library via dlsym()
under GNU/Linux and obviously get some kind of race condition resulting in a segmentation fault. The backtrace looks something like this:
(gdb) backtrace
#0 do_lookup_x at dl-lookup.c:366
#1 _dl_lookup_symbol_x at dl-lookup.c:829
#2 do_sym at dl-sym.c:168
#3 _dl_sym at dl-sym.c:273
#4 dlsym_doit at dlsym.c:50
#5 _dl_catch_error at dl-error.c:187
#6 _dlerror_run at dlerror.c:163
#7 __dlsym at dlsym.c:70
#8 ... (our code)
My local machine uses glibc-2.23.
I discovered, that the library handle given to __dlsym()
in frame #7 is different to the handle passed to _dlerror_run()
. It runs wild in the following lines in dlsym.c
:
void *
__dlsym (void *handle, const char *name DL_CALLER_DECL)
{
# ifdef SHARED
if (__glibc_unlikely (_dlfcn_hook != NULL))
return _dlfcn_hook->dlsym (handle, name, DL_CALLER);
# endif
struct dlsym_args args;
args.who = DL_CALLER;
args.handle = handle; /* <------------------ this isn't my handle! */
args.name = name;
/* Protect against concurrent loads and unloads. */
__rtld_lock_lock_recursive (GL(dl_load_lock));
void *result = (_dlerror_run (dlsym_doit, &args) ? NULL : args.sym);
__rtld_lock_unlock_recursive (GL(dl_load_lock));
return result;
}
GDB says
(gdb) frame 7
#7 __dlsym at dlsym.c:70
(gdb) p *(struct link_map *)args.handle
$36 = {l_addr= 140736951484536, l_name = 0x7fffe0000078 "\300\215\r\340\377\177", ...}
so this is obviously garbage. The same occurs in the higher frames, e.g. in frame #2:
(gdb) frame 2
#2 do_sym at dl-sym.c:168
(gdb) p handle
$38 = {l_addr= 140736951484536, l_name = 0x7fffe0000078 "\300\215\r\340\377\177", ...}
Unfortunately the parameter handle
in frame #7 can't be displayed:
(gdb) p handle
$37 = <optimized out>
but surprisingly in frame #8 and further down in our code the handle was correct:
(gdb) frame 8
#8 ...
(gdb) p *(struct link_map *)libHandle
$38 = {l_addr = 140737160646656, l_name = 0x7fffd8005b60 "/path/to/libfoo.so", ...}
Now my conclusion is, that the variable args
must be modified during the execution inside __dlsym()
but I can't see where and why.
I have to confess, there's a second aspect to this problem: It only occurs in a multi-threaded environment and only sometimes. But as you can see, there are some counter measures for race conditions in the implementation of __dlsym()
since they're calling __rtld_lock_(un)lock_recursive()
and the local variable args
isn't shared across threads. And curiously enough, the problem still persists, if I make frame #8 mutual exclusive among my threads.
Questions: What are possible sources for the discrepancy in the library handle between frame #8 and frame #7?
Question 2: Does dlopen()
yield different values for different threads? Or to put it differently: Is it possible to share the handles returned by dlopen()
between different threads.
Update: I thank everybody commenting on this question and trying to answer it despite the lack of almost any viable information to do so. I found the solution of this problem. As foreseen by the commenters, it was totaly unrelated to the stacktraces and other information I provided. Hence, I consider this question as closed and will flag it for deletion. So Long, and Thanks for All the Fish