3

I made a C++ tool for off-screen rendering of 3D models. The rendering is done using OSMesa library.

The software was working flawlessly for more than a year, and I stopped to make updates to it something like 6 months ago. In the meanwhile my development environment was updated multiple times.

Now I was compiling it again and found an unexpected bug.

The plain version of the software was still working as expected, but the statically linked one is segfaulting.

I'm assuming that the error is mine in the OSmesa configuration/compilation/linking procedure and not in the library code, but any advice about better debugging of the segmentation fault is appreciated.

Having tried numerous variations of the compilation process without success, I'm now quite stuck. Anyone can see something stupid I'm doing in some of the steps described below?


I recompiled a static version of the OSmesa library with the same version of the shared library that is working in my system (12.0.6), disabling all the non-needed features (using an Ubuntu based system, no static version of OSmesa lib is available from repositories):

./configure \
    --disable-xvmc \
    --disable-glx \
    --disable-dri \
    --with-dri-drivers="" \
    --with-gallium-drivers="" \
    --disable-shared-glapi \
    --disable-egl \
    --with-egl-platforms="" \
    --enable-osmesa \
    --enable-gallium-llvm=no \
    --disable-gles1 \
    --disable-gles2 \
    --enable-static \
    --disable-shared

This is the compile command of my off-screen rendering tool:

g++ -std=c++11 -Wall -O3 -g -static -static-libgcc -static-libstdc++ ./src/measure_model.cpp model.o thumbnail.o -o measure_model_debug -pthread -lOSMesa -ldl -lm -lpng -lz -lcrypto

This is a warning that I was getting by statically compiling using OSMesa, and it was present even a year ago with the working static binary:

/home/XXX/XXX/backend/lambda/mesa/mesa-12.0.6/src/mesa/main/dlopen.h:52: warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking

This is what I get from running the tool:

Segmentation fault (core dumped)

But no segmentation fault is produced if I simply skip the OSmesa context creation step (and obviously all the 3D rendering)

This is the backtrace:

#0  0x0000000000000000 in ?? ()
#1  0x00000000004af20a in mtx_init (type=4, mtx=0xe10f70) at ../../include/c11/threads_posix.h:215
#2  _mesa_NewHashTable () at main/hash.c:135
#3  0x000000000052f295 in _mesa_alloc_shared_state (ctx=ctx@entry=0xdcc9b0) at main/shared.c:67
#4  0x000000000046e717 in _mesa_initialize_context (ctx=ctx@entry=0xdcc9b0, api=api@entry=API_OPENGL_COMPAT, visual=, share_list=share_list@entry=0x0, driverFunctions=driverFunctions@entry=0x7fffffffcd40) at main/context.c:1192
#5  0x000000000046c870 in OSMesaCreateContextAttribs (attribList=attribList@entry=0x7fffffffd290, sharelist=) at osmesa.c:834
#6  0x000000000046ccdc in OSMesaCreateContextExt (format=, depthBits=, stencilBits=, accumBits=, sharelist=) at osmesa.c:660
#7  0x0000000000468742 in generate_thumbnail(Model*, Json::Value) ()
#8  0x0000000000401c7d in main (argc=, argv=) at ./src/measure_model.cpp:107

A statically linked binary is a strict requirement.

The segmentation fault is happening on the same machine I use to compile the tool (OSmesa static lib is compiled in the same machine too), but no segmentation fault in the non-statically linked version of the same tool.

pangon
  • 439
  • 5
  • 18
  • 1
    Please, run the faulting program under gdb; after segv post live output `bt`, `info reg`, `frame 1; disassemble`. The [`mtx_init` uses some pthread](https://github.com/anholt/mesa/blob/master/include/c11/threads_posix.h#L200) mutex/mutex_attr functions, you have some problems with pthread usage in static programs. This may be the bad idea, try to change the strictness of your requirement (link to glibc & pthread dynamically, for running in older OS use own copy of glibc+pthreads and rpath to link to them). – osgx Jun 06 '17 at 01:30
  • Thanks osgx, I'm going to make this additional debug and will update the question. There is any know problem using pthreads with statically linked programs? – pangon Jun 08 '17 at 02:25
  • 1
    Great, after some testing it turned out that in fact is the statically linked pthread library to cause the problem. My actual use case requires to statically link most of the libraries, but not the core ones. Is OK to have dl and pthread linked dinamically, solving my problem. Thanks a lot. I'm disappointed to see this limitation in the way C++ binaries are linked to pthread! I hope to find out the sense of this pthread limitation after studying the cases I can see online. Thanks again @osgx , if you can post an answer to this question I will mark it as correct, giving you the bounty :) – pangon Jun 08 '17 at 03:15
  • This is dup of https://bugzilla.redhat.com/show_bug.cgi?id=115157 "executables linked statically with /usr/lib/nptl/libpthread.a fail" reported several (1) dozens years ago with solution by Jakub Jelinek: "*First of all, avoid -static if you can, it only creates problems*" and "*If you really need* .. **just use `-Wl,--whole-archive -lpthread -Wl,--no-whole-archive` instead of `-pthread`**" – osgx Jun 08 '17 at 04:06

1 Answers1

2

This is what I get from running the tool: Segmentation fault (core dumped)

But no segmentation fault is produced if I simply skip the OSmesa context creation step (and obviously all the 3D rendering)

So, there is some problem from OSmesa creation. With your backtrace we can see that top function was executed from EIP of zero (jump to NULL / call of NULL), so there is call of some function in mtx_init, which is part of OS Mesa context creating.

#0  0x0000000000000000 in ?? ()
#1  0x00000000004af20a in mtx_init (type=4, mtx=0xe10f70) at ../../include/c11/threads_posix.h:215
#2  _mesa_NewHashTable () at main/hash.c:135
#3  0x000000000052f295 in _mesa_alloc_shared_state (ctx=ctx@entry=0xdcc9b0) at main/shared.c:67
#4  0x000000000046e717 in _mesa_initialize_context (ctx=ctx@entry=0xdcc9b0, api=api@entry=API_OPENGL_COMPAT, visual=, share_list=share_list@entry=0x0, driverFunctions=driverFunctions@entry=0x7fffffffcd40) at main/context.c:1192
#5  0x000000000046c870 in OSMesaCreateContextAttribs (attribList=attribList@entry=0x7fffffffd290, sharelist=) at osmesa.c:834
#6  0x000000000046ccdc in OSMesaCreateContextExt (format=, depthBits=, stencilBits=, accumBits=, sharelist=) at osmesa.c:660
#7  0x0000000000468742 in generate_thumbnail(Model*, Json::Value) ()
#8  0x0000000000401c7d in main (argc=, argv=) at ./src/measure_model.cpp:107

What was the function? According to online sources of include/c11/threads_posix.h: mtx_init() on github, there are only calls to pthread_mutex_init, pthread_mutexattr_init and several other mutex related functions of libpthread (-lpthread).

Why there was produced call to NULL instead of real function? Probably due to using static linkage of glibc and/or libpthread. Exact problem is still unidentified at this moment (I was able to found report of statically linked libpthread.a into some shared lib which is incorrect and will never work).

In your case there is only alias (strong one) of pthread_mutex_init in glibc/nptl/pthread_mutex_init.c (line 150) strong_alias (__pthread_mutex_init, pthread_mutex_init) and there may be some weak alias of the symbol in the glibc itself, probably uninitialized. Some was wrong in your linking options or/and in ld mind and he did not find/link the nptl/pthread_mutex_init.o (it is part of libpthread.a archive) with real symbol into final executable (ld often skips unused/unneeded objects of .a archives and don't link them into final executable), keeping the relocation pointing into NULL. Some expert of glibc may know, Employed Russian is one of experts on SO.

I suggest to link statically only to your internal libs or probably also to normal non-system libs like mesa (you may use -Wl,-Bstatic -lyour_lib -Wl,-Bdynamic options to temporary change linkage to static for libs listed between; or use cheat option of -l: as -l:libYour_lib.a found by Radek in the same q.). But do not link statically to most basic libs of glibc like libc, libpthread, librt (there are some problems in static linking of glibc when nss is used: target system must have exact same version of dynamic glibc to enable nss to work).

If you want to pack your application for older machines and you needs some features of glibc you may also try to pack your own version of shared glibc libs with your application; put them to some subdirectory, add rpath option of linker to change library search paths, also change INTERP section from default ABI ld-linux.so.2 loader to your own copy of ld-linux.so.2 from your version of glibc, ... And you still will have problems with too old kernels, as newer glibcs requires some modern features (syscalls, structs) of rather new kernel.

Or you can pack your application into some sort of container like Docker, or some other isolation solution (or chroot?) to always have your versions of libs...

UPDATE: Just found report of similar bt with NULL instead of mutex implementation from nptl: https://bugzilla.redhat.com/show_bug.cgi?id=163083 "Statically linked C++ program using pthreads will segfault" (2005-2007) pthread_mutex_init(&lock, NULL); g++ -g -static foo.cpp -o foo -lpthread where #0 0x00000000 in ?? () #1 0x08048232 in main () at foo.cpp:7

This is apparently due to certain pthreads functions not being included in the output executable. This bug may duplicate #115157, and I apologize if so, but hopefully the included test case will be useful.

Additional info:

The suggestion in #115157 to forcibly link in all of libpthread.a is a valid workaround.

https://bugzilla.redhat.com/show_bug.cgi?id=115157 "executables linked statically with /usr/lib/nptl/libpthread.a fail" - 2004-2009 CLOSED WONTFIX

Jakub Jelinek 2004-10-29 05:26:10 EDT

First of all, avoid -static if you can, it only creates problems, both portability wise and others as well.

If you really need to create statically linked binary with -lpthread linked in, then just use -Wl,--whole-archive -lpthread -Wl,--no-whole-archive instead of -pthread. Anything else has really many problems.

osgx
  • 90,338
  • 53
  • 357
  • 513