0

I have a problem understanding, what exactly happens, when a dynamic library is loaded at runtime and how the dynamic linker recognizes and treats "same symbols".

I've read other questions related to symbolic linking and observed all the typical recommendations (using extern "C", using -fPIC when linking the library, etc.). To my knowledge, my specific problem was not discussed, so far. The paper "How to write shared libraries" https://www.akkadia.org/drepper/dsohowto.pdf does discuss the process of resolving library symbol dependencies, that may explain what's happening in my example below, but alas, it does not offer a workaround.

I found a post where the last (unfortunately) un-answered comment is very much the same as my problem:

Is there symbol conflict when loading two shared libraries with a same symbol

Only difference is: in my case the symbol is being an auto-generated constructor.

Here's the setup (Linux):

  • program "master" uses some library class declaration "Dummy" with 4 members variables and loads dynamically a shared library via dlopen() and resolves two simple functions with dlsym()
  • the shared library "slave" uses also the library with the class "Dummy", yet in a newer version with 5 member variables (extra string)
  • when the shared library's function is called from master, accessing the newly added string member in class Dummy segfaults - apparently the string wasn't initialized correctly

My assumption is: the constructor of class Dummy exists already in memory since master uses this function itself, and when loading the shared library it does not load its own version of the constructor, but simply re-uses the existing version from master. By doing that the extra string variable is not initialized correctly in the constructor, and accessing it segfaults.

When debugging into the assembler code when initializing the Dummy variable d in the slave, indeed Dummy's constructor inside the master's memory space is being called.

Questions:

  1. How does the dynamic linker (dlopen()?) recognize, that the class Dummy used to compile the master should be the same as Dummy compiled into Slave, despite it being provided in the library itself? Why does the symbol lookup take the master's variant of the constructor, even though the symbol table must also contain the constructor symbol imported from the library?

  2. Is there a way, for example by passing some suitable options to dlopen() or dlsym() to enforce usage of the Slave's own Dummy constructor instead of the one from Master (i.e. tweak the symbol lookup/reallocation behavior)?

Code: full minimalistic source code example can be found here:

https://bauklimatik-dresden.de/privat/nicolai/tmp/master-slave-test.tar.bz2

Relevant shared lib loading code in Master:

#include <iostream>
#include <dlfcn.h>  // shared library loading on Unix systems
#include "Dummy.h"

int create(void * &data);
typedef int F_create(void * &data);

int destroy(void * data);
typedef int F_destroy(void * data);

int main() {
    // use dummy class at least once in program to create constructor
    Dummy d;
    d.m_c = "Test";

    // now load dynamic library
    void *soHandle = dlopen( "libSlave.so", RTLD_LAZY );
    std::cout << "Library handle 'libSlave.so': " << soHandle << std::endl;
    if (soHandle == nullptr)
        return 1;

    // now load constructor and destructor functions
    F_create * createFn = reinterpret_cast<F_create*>(dlsym( soHandle, "create" ) );
    F_destroy * destroyFn = reinterpret_cast<F_destroy*>(dlsym( soHandle, "destroy" ) );

    void * data;
    createFn(data);
    destroyFn(data);

    return 0;
}

Class Dummy: the variant without "EXTRA_STRING" is used in Master, with extra string is used in Slave

#ifndef DUMMY_H
#define DUMMY_H

#include <string>

#define EXTRA_STRING

class Dummy {
public:
    double          m_a;
    int             m_b;
    std::string     m_c;
#ifdef EXTRA_STRING
    std::string     m_c2;
#endif // EXTRA_STRING
    double          m_d;
};

#endif // DUMMY_H

Note: if I use exaktly same class Dummy both in Master and Slave, the code works (as expected).

Employed Russian
  • 199,314
  • 34
  • 295
  • 362
Morlok
  • 11
  • 2
  • 1
    If the shared library that you load does not match the include header file, then segfault is a typical outcome. – stark Jan 21 '22 at 19:09
  • This might be a bit oversimplified statement: the declaration could be inside a cpp file as well, and still the problem would occur. Understanding, why the dynamic linker does not just import all functions from the shared library, would be helpful. – Morlok Jan 21 '22 at 19:51
  • I don't really understand what you wish to accomplish, but you could consider using `opaque datatypes` explained here: https://yosefk.com/c++fqa/ref.html#fqa-8.7 – Zsigmond Lőrinczy Jan 21 '22 at 20:25
  • I've edited the question to make a more clear. Basically, I want the dynamic linker when it loads the library and relocates the symbols/resolves function addresses to choose the functions provided in the library rather than those already existing in the symbol table from the masters process. "opaque datatypes" or the reference you provided have nothing to do with my problem. – Morlok Jan 21 '22 at 20:43
  • please check first answer as reference: https://stackoverflow.com/questions/34073051/when-we-are-supposed-to-use-rtld-deepbind – stvo Jan 24 '22 at 08:20

2 Answers2

1

When debugging into the assembler code when initializing the Dummy variable d in the slave, indeed Dummy's constructor inside the master's memory space is being called.

This is expected behavior on UNIX. Unlike Windows DLLs, UNIX shared libraries are designed to imitate archive libraries, and are not designed to be self-contained isolated units of code.

How does the dynamic linker (dlopen()?) recognize, that the class Dummy used to compile the master should be the same as Dummy compiled into Slave, despite it being provided in the library itself? Why does the symbol lookup take the master's variant of the constructor, even though the symbol table must also contain the constructor symbol imported from the library?

The dynamic loader doesn't care (or know anything) about any classes. It operates of symbols.

By default symbols are resolved to the first definition of any given symbol which is visible to the dynamic loader (the exported symbol).

You can examine the set of symbols which are exported from any given binary with nm -CD Master and nm -CD libSlave.so.

Is there a way, for example by passing some suitable options to dlopen() or dlsym() to enforce usage of the Slave's own Dummy constructor instead of the one from Master (i.e. tweak the symbol lookup/reallocation behavior)?

There are several ways to modify the default behavior.

The best approach is to have libSlave.so use its own namespace. That will change all the (mangled) symbol names, and will completely eliminate any collisions.

The next best approach is to limit the set of symbols which are exported from libSlave.so, by compiling with -fvisibility=hidden and adding explicit __attribute__((visibility("default"))) to the (few) functions which must be visible from that library (create and destroy in your example).

Another possible approach is to link libSlave.so with -Wl,-Bsymbolic flag, thought the symbol resolution rules get pretty complicated really fast, and unless you understand them all, it's best to avoid doing this.


P.S. One might wonder why the Master binary exports any symbols -- normally only symbols referenced by other .sos used during the link are exported.

This happens because cmake uses -rdynamic when linking the main executable. Why it does that, I have no idea.

So another workaround is: don't use cmake (or at least not with the default flags it uses).

Employed Russian
  • 199,314
  • 34
  • 295
  • 362
  • Thanks for the insight. Indeed, runnin nm on both the master binary and the slave-lib show identical symbols for the automatically generated constructor. Also thanks for the idea with -rdynamic - I checked that and tried manually without cmake and the problem persistet. However, I managed to get this working without changing the code -> see my answer below. I'll try your suggestions with the visibility-hidden, next. Thanks. – Morlok Jan 23 '22 at 09:58
  • @AndreasNicolai " I managed to get this working without changing the code" -- why is that a desirable goal? I know quite a bit about shared libraries, and I don't understand why your solution works. You probably don't either, which means that it could break tomorrow. – Employed Russian Jan 23 '22 at 17:14
  • Indeed, you are right - I don't really know why this works. But so far I haven't found an in-depth explanation about this behaviour. With respect to the unchanged code question: the libs to be loaded at run-time are independently developed and I have no source code access and cannot dictate any changes in the code base. So I needed a solution that works with closed-source binaries (though I admit this is suboptimal at best). – Morlok Jan 24 '22 at 09:28
  • Regarding your comment about the use of `-rdynamic`: since none of the runtime-imported libraries use any exposed function from the master, using the flag is not needed and, indeed, should be removed. See example in: https://stackoverflow.com/questions/36692315/what-exactly-does-rdynamic-do-and-when-exactly-is-it-needed In my example, however, removing the flag does not make any difference – Morlok Jan 24 '22 at 09:34
0

I followed the recommendations found in the last answer and Is there symbol conflict when loading two shared libraries with a same symbol :

  • running 'nm Master' and 'nm libSlave.so' showed the same automatically generated constructor symbols:
...
000000000000612a W _ZN5DummyC1EOS_
00000000000056ae W _ZN5DummyC1ERKS_
0000000000004fe8 W _ZN5DummyC1Ev
...

So, the mangled function signatures match in both the master's binary and the slave.

When loading the library, the master's function is used instead of the library's version. To study this further, I created an even more minimalistic example like in the post referenced above:

master.cpp

#include <iostream>

#include <dlfcn.h>  // shared library loading on Unix systems

// prototype for imported slave function
void hello();
typedef void F_hello();

void printHello() {
    std::cout << "Hello world from master" << std::endl;
}

int main() {
    printHello();

    // now load dynamic library
    void *soHandle = nullptr;
    const char * const sharedLibPath = "libSlave.so";
    // I tested different RTLD_xxx options, see text for explanations
    soHandle = dlopen( sharedLibPath, RTLD_NOW | RTLD_DEEPBIND);
    if (soHandle == nullptr)
        return 1;

    // now load shared lib function and execute it
    F_hello * helloFn = reinterpret_cast<F_hello*>(dlsym( soHandle, "hello" ) );
    helloFn();

    return 0;
}

slave.h

#pragma once

#ifdef __cplusplus
extern "C" {
#endif

void hello();

#ifdef __cplusplus
}
#endif

slave.cpp

#include "slave.h"
#include <iostream>

void printHello() {
    std::cout << "Hello world from slave" << std::endl;
}

void hello() {
    printHello(); // should call our own hello() function
}

You notice the same function printHello() exists both in the library and the master.

I compiled both manually this time (without CMake) and the following flags:

# build master
/usr/bin/c++ -fPIC -o tmp/master.o -c master.cpp
/usr/bin/c++ -rdynamic tmp/master.o  -o Master  -ldl

# build slave
/usr/bin/c++ -fPIC -o tmp/slave.o -c slave.cpp
/usr/bin/c++ -fPIC -shared -Wl,-soname,libSlave.so -o libSlave.so tmp/slave.o

Mind the use of -fPIC in both master and slave-library.

I now tried several combinations of RTLD_xx flags and compile flags:

1.

dlopen() flags: RTLD_NOW | RTLD_DEEPBIND -fPIC for both libs

Hello world from master
Hello world from slave

-> result as expected (this is what I wanted to achieve)

2.

dlopen() flags: RTLD_NOW | RTLD_DEEPBIND -fPIC for only the library

Hello world from master
Speicherzugriffsfehler  (Speicherabzug geschrieben) ./Master

-> Here, a segfault happens in the line where the iostream libraries cout call is made; still, the printHello()s function in the library is called

3.

dlopen() flags: RTLD_NOW -fPIC for only the library

Hello world from master
Hello world from master

-> This is my original behavior; so RTLD_DEEPBIND is definitely what I need, in conjunction with -fPIC in the master's binary;

Note: while CMake automatically adds -fPIC when building shared libraries, it does not generally do this for executables; here you need to manually add this flag when building with CMake

Note2: Using RTLD_NOW or RTLD_LAZY does not make a difference.

Using the combination of -fPIC on both executable and shared lib, with RTLD_DEEPBIND lets the original example with the different Dummy classes work without problems.

Morlok
  • 11
  • 2