0

I am running a server application on a gentoo (3.4.66) machine.

Sometimes the application segfaults nearly directly after start, but when it runs, there doesn't occur any segfault at all. For some reason, this did not happen (yet) in release build, but only occurs during debug build. This is quite confusing, as I usually have to deal with the other way round.

When I start the application from gdb, it looks like this:

Program received signal SIGSEGV, Segmentation fault.
0xb67db7a9 in ?? () from /lib/libc.so.6
(gdb) bt
#0  0xb67db7a9 in ?? () from /lib/libc.so.6
#1  0x00000002 in ?? ()
#2  0x00000001 in ?? ()
#3  0xb68a0158 in ?? () from /lib/libc.so.6
#4  0x00000040 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Sometimes it looks like this:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb1f62b90 (LWP 6624)]
0xb67d81c4 in ?? () from /lib/libc.so.6
(gdb) back
#0  0xb67d81c4 in ?? () from /lib/libc.so.6
#1  0x00000000 in ?? ()

From what I currently know, the corrupt stack and those odd looking adresses (like 0x00000001) seem to be connected together in some kind of stack smashing.

Additional information

I already checked for things related to assert() expressions (like assert(readFile(filename)) does not work in release mode) but could not find any.

Application is built on another system, than it is ran. They have same libs and dependencies as far as I can tell from ldd

Output shown is from a debug build of the application

Can anyone tell me how to solve this, or give hints, where there might be an issue. Maybe there is a way to extract something from the adresses of the stack frames?

Edit: (Updated)
Unfortunately, I only have (release) libc available. The machine does only have local network access and there is no packaging system installed (-bash: emerge: command not found)

I start the app from GDB with this command in my binary folder: LD_LIBRARY_PATH=$PWD/usr/local/lib gdb Server-Linux where "Server-Linux" is a link to the debug executable

Edit2:
thanks to the hint of @n.m. I have valgrind and can run the app with it now. It gives me 142 loss records and:

==15835== LEAK SUMMARY:
==15835==    definitely lost: 544 bytes in 16 blocks.
==15835==    indirectly lost: 8,212 bytes in 2 blocks.
==15835==      possibly lost: 26,192 bytes in 1,062 blocks.
==15835==    still reachable: 141,010 bytes in 800 blocks.
==15835==         suppressed: 0 bytes in 0 blocks.

Also interesting: running it on valgrind (low performance) did not crash yet withing seven tries. I am checking for race conditions and keep the question updated

Cœur
  • 37,241
  • 25
  • 195
  • 267
  • I believe your stack pointer is corrupted at all. So there is no need to investigate anything from that. Try to run your application with efence, valgrind, etc. – Klaus Dec 05 '17 at 13:18
  • Valgrind, asan and ubsan are your friends. – n. m. could be an AI Dec 05 '17 at 13:27
  • I am not able to install anything on the run machine. Updated the question – MauriceRandomNumber Dec 05 '17 at 14:25
  • "There is no possibility to install" no such thing. If you can run programs on the target machine, you can build tools from source. – n. m. could be an AI Dec 05 '17 at 20:39
  • Are you actually *running* the app under GDB, or are you analyzing a core dump? – Employed Russian Dec 05 '17 at 23:08
  • Running under GDB , but without GDB it also crashes. I updated the question with the full run-command – MauriceRandomNumber Dec 06 '17 at 09:41
  • @n.m. you are right, valgrind is now installed on the machine, but I struggle to build the debug version of libc (I need to use gcc 3.4.4 for the whole server. Right now I struggle building debug libc, but am investigating) Valgrind gives me tremendous output with 142 loss records. I need some time to check that. – MauriceRandomNumber Dec 06 '17 at 09:48
  • Why do you need the debug version of libc? I have caught countless bugs with valgrind without ever using debug libc... – n. m. could be an AI Dec 06 '17 at 09:54
  • @n.m. from some research in the internet, I assumed that those `#0 0xb67d81c4 in ?? () from /lib/libc.so.6` are a result of the release libc, and would get a method name from the debug libc. Is this not the case? – MauriceRandomNumber Dec 06 '17 at 10:02
  • 1
    If your stack is corrupt you cannot get any method names, and if it is intact you should be able to get (just) method names from the release libc. Run valgrind and see if you get any error report with a good stack trace. Also run ltrace and see what is your last libc call and last syscall before the crash. You can set breakpoints on those and see if the parameters passed are legit. Also try the sanitizers. – n. m. could be an AI Dec 06 '17 at 10:13
  • 1
    Thank you again. I had lots of input from valgrind and finally found the problem. For anyone in interest: I am using c++ 98/03. When using an abstract base class as list type (e.g. std::vector), deleting the pointer of the base class is not enough. The additional variables from derived classes have not been freed after calling delete on the base class pointer. Had to make the destructor of the base class virtual. My coworkes called it a 'classic'. Thanks for the input. Really did help a lot – MauriceRandomNumber Jan 23 '18 at 14:32
  • Ran out of space. This question here https://stackoverflow.com/questions/461203/when-to-use-virtual-destructors describes exactly my problem. – MauriceRandomNumber Jan 23 '18 at 14:39

0 Answers0