0

We are observing a core dump quite randomly, under heavy load conditions. When we load the core file and look at the location of the core dump it is always pointing to the last line of the function, precisely the line number of the closing brace.

The function has some legacy goto statements. When we had similar issue earlier, we moved creation of all local objects to the top of the function and that appeared to have fixed the issue on Solaris Unix 10. (Our suspicion and some sample tests showed that when goto statements were executed, some of these local variables were never created but their destructors were always invoked. So moving them all the way to the top ensured that they are always constructed properly). But the problem is still happening on the Linux, while we don't see this issue any more on Solaris.

Updated with stack trace :

#0  0x008a5206 in raise () from /lib/libc.so.6

#1  0x008a6bd1 in abort () from /lib/libc.so.6

#2  0x008de3bb in __libc_message () from /lib/libc.so.6

#3  0x00966634 in __stack_chk_fail () from /lib/libc.so.6

#4  0x08e9ebf5 in our_function (this=0xd2f2c380)

    at sourcefilename.cc:9887

Anybody encountered similar issue? Greatly appreciate any help or pointers to understand and fix the issue. Thanks a ton.

  • 3
    Perhaps you could provide some code to show what's going on? – Dave Aug 15 '11 at 21:18
  • 1
    Simple, you have bugs which likely are corrupting the stack frame, so it crashes when you return. I would suggest looking at instrumenting it with Valgrind, but without actual code we can't help. – Yann Ramin Aug 15 '11 at 21:22
  • 2
    C++ has a rule that a goto cannot skip the construction of an object, for precisely the reason you outline. So if you had code which was doing this it should not compile. Sounds to me that by moving things around you've just masked the real problem, but without seeing any code who can say for sure. – john Aug 15 '11 at 21:25
  • Some background on ctors and goto in C++, http://stackoverflow.com/questions/6537948/storage-allocation-of-local-variables-inside-a-block-in-c – john Aug 15 '11 at 21:29
  • I understand if i could post the code, it may help to see the issue. But its proprietary 3rd party library code so I won't be able to post it. To obliterate a little bit to post, the function is almost 2000 lines and calls lots of other objects and functions. – user895643 Aug 15 '11 at 22:13
  • This question is nearly impossible to answer without seeing the referenced code. – Tim Post Aug 16 '11 at 00:10
  • @user 9000 line cc file? 2000 line function?? It's time to refactor. – Sam Miller Aug 16 '11 at 00:54
  • Sam, Tim, bmargulies, jtbandes : Give a benefit of doubt to those who work behind corporate walls that are not at liberty to share actual code. Thanks a ton to @MarkR for explaining a possible issue and providing a pointer towards buffer overrun. That really helped. – user895643 Aug 17 '11 at 19:26
  • @sam-miller : Re-factoring just based on size of a function doesn't make sense to our product. Most of this code does very specific job and very unlikely to be reused. This function calls about 50-60 different other reusable functions already. If we have to re-factor our code based on size, we will end up having about 10000 files. We also end up creating functions that can not fulfill any specific responsibility by themselves. – user895643 Aug 17 '11 at 19:41

1 Answers1

2

I suspect you're overrunning a buffer in a growing-downwards stack (most stacks grow downwards; I don't know whether Linux or Solaris use downwards stacks on all architectures, but definitely some of them). At this point, it overwrites the return address, and the program counter jumps to an illegal address, generating the crash at precisely where the function returns.

Just Valgrind it, it will probably tell you what's happening (or rather, where the overrun is).

MarkR
  • 62,604
  • 14
  • 116
  • 151
  • Thank you Mark. Will review the code and also try valgrind to see what we can gather. I posted the stack trace. The __stack_chk_fail indicates a buffer overrun as per documentation. – user895643 Aug 15 '11 at 22:15
  • 1
    We could not get to run with valgrind, but we found out from coverity static analysis report that we are overwriting an array created locally in the function. We fixed the array size issue and this problem is now gone. Thank you very much. – user895643 Aug 17 '11 at 19:19
  • 1
    We are using -fstack-protector option on the compilation, that forced the core on the overrun. We wouldn't have forced the core dump without this option. We ran tests with and without this option. Some notes on this option here: [GCC extension for protecting applications from stack-smashing attacks](http://www.trl.ibm.com/projects/security/ssp/) – user895643 Aug 18 '11 at 14:34