0

Today, I had to realize to my horror that my C++ simulation program crashed after running for 12 days, just several lines before its end, leaving me with nothing but a (truncated) core dump.

Analysis of the core dump with gdb revealed, that the

Program terminated with signal SIGBUS, Bus error.

and that the crash occured at the following line of my code:

seconds = std::difftime(stopTime, startTime); // seconds is of type double

The variables stopTime and startTime are of type std::time_t and I was able to extract their values at crash time from the core dump:

startTime: 1426863332
stopTime:  1427977226

The stack trace above the difftime-call looks like this:

#0  0x.. in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#1  0x.. in _dl_runtime_resolve () from /lib64/ld-linux-x86-64.so.2

I wrote a small program to reproduce the error, but without success. Just calling std::difftime(stopTime, startTime) with the above values does not cause a SIGBUS crash. Of course, I don't want that to happen again. I have successfully executed the same program several times before (although with different arguments) with comparable execution times. What could cause this problem and how can I prevent it in the future?

Here is some additional system information.

GCC: (SUSE Linux) 4.8.1 20130909 [gcc-4_8-branch revision 202388]  
Linux Kernel: 3.11.10-25-desktop, x86_64
C++ standard library: 6.0.18

Edit

Here is some more context. First, the complete stack trace (ellipsis [..] mine):

#0  0x00007f309a4a5bca in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#1  0x00007f309a4ac195 in _dl_runtime_resolve () from /lib64/ld-linux-x86-64.so.2
#2  0x0000000000465453 in CStopwatch::getTime (this=0x7fff0db48c60, delimiterHourMinuteSecondsBy="") at [..] CStopwatch.cpp:86
#3  0x00000000004652a9 in CStopwatch::stop (this=0x7fff0db48c60) at [..] CStopwatch.cpp:51
#4  0x0000000000479a0c in main (argc=33, argv=0x7fff0db499c8) at [..] coherent_ofdm_tsync_mse.cpp:998

The problem occurs in an object of class CStopwatch which is created at the beginning of the program. The stopwatch is started in main() at the very top. After the simulation is finished, the function CStopwatch::stop( ) is called.

The constructor of the stopwatch class:

/*
 * Initialize start and stop time on construction
 */
CStopwatch::CStopwatch()
{
  this->startTime = std::time_t( 0 );
  this->stopTime = std::time_t( 0 );
  this->isRunning = false;
}

The function CStopwatch::stop( )

/*
 * Stop the timer and return the elapsed time as a string
 */
std::string CStopwatch::stop( )
{
  if ( this->isRunning ) {
    this->stopTime = std::time( 0 );
  }
  this->isRunning = false;

  return getTime( );
}

The function CStopwatch::getTime()

/*
 * Return the elapsed time as a string
 */
std::string CStopwatch::getTime( std::string delimiterHourMinuteSecondsBy )
{
  std::ostringstream timeString;

// ...some string init      

  // time in seconds
  double seconds;

  if ( this->isRunning ){
    // return instantaneous time
    seconds = std::difftime(time(0), startTime);
  } else {
    // return stopped time
    seconds = std::difftime(stopTime, startTime); // <-- line where the
                                                  // program crashed
  }

  // ..convert seconds into a string

  return timeString.str( );
}

At the beginning of the program CStopwatch::start( ) is called

/*
 * Start the timer, if watch is already running, this is effectively a reset
 */
void CStopwatch::start( )
{
  this->startTime = std::time( 0 );
  this->isRunning = true;
}
Deve
  • 4,528
  • 2
  • 24
  • 27
  • Is that the complete stacktrace? Is `seconds` a member variable? If it is, what is the value of `this` then? Is it a pointer to a valid object? When and where do you call the function? Can you show some more context? – Some programmer dude Apr 02 '15 at 14:14
  • @JoachimPileborg No, the `#2` is the line of code I have provided and the rest are some calls inside my source code which I think cannot be understood without (large) context. – Deve Apr 02 '15 at 14:16
  • 1
    Well the problem is most likely not in the library code, which leaves your code as the cause of the crash. You need to provide more information and context about the surrounding code. – Some programmer dude Apr 02 '15 at 14:16
  • @JoachimPileborg I tried to add some hopefully relevant code – Deve Apr 02 '15 at 14:37

3 Answers3

4

There are only a few reasons that a program may receive SIGBUS on Linux. Several are listed in answers to this question.

Look in /var/log/messages around the time of the crash, it is likely that you'll find that there was a disk failure, or some other cause for kernel unhappiness.

Another (unlikely) possibility is that someone updated libstdc++.so.6 while your program was running, and has done so incorrectly (by writing over existing file, rather than removing it and creating new file in its place).

Community
  • 1
  • 1
Employed Russian
  • 199,314
  • 34
  • 295
  • 362
3

It looks like std::difftime is being lazily loaded on its first access; if some of the runtime linker's internal state had been damaged elsewhere in your program, it could cause this.

Note that _dl_runtime_resolve would have to complete before the std::difftime call can begin, so the error is unlikely to be with your time values. You can easily verify by opening the core file in gdb:

(gdb) frame 2 # this is CStopwatch::getTime
(gdb) print this
(gdb) print *this

etc. etc.

If gdb is able to read and resolve the address, and the values look sane, that definitely didn't cause a SIGBUS at runtime. Alternatively, it's possible your stack is smashed; if _dl_fixup is preparing the trampoline jump rather than just handling relocation etc.; we can't be certain without looking at the code, but can check the stack itself:

(gdb) print %rsp
(gdb) x/16xb $rsp-16 # print the top 16 bytes of the stack

The easy workaround to try is setting the LD_BIND_NOW environment variable and forcing symbol resolution at startup. This just hides the problem though, because some memory is still getting damaged somewhere, and we're only hiding the symptom.

As for fixing the problem properly - even if short runs don't exhibit the error, it's possible some memory damage is occurring but is asymptomatic. Try running a shorter simulation under valgrind and fix all warnings and errors unless you're certain they're benign.

Useless
  • 64,155
  • 6
  • 88
  • 132
2

Impossible to tell without further context, but:

  • this could be null or corrupt
  • startTime could be a null reference
  • stopTime could be a null reference

I was going to suggest you set a breakpoint on the line and print out stopTime and startTime, but you've already nearly done that by looking at the core file.

It looks as if something is going wrong linking the function in. Might it be that you are compiling against a different set of headers from the standard library you are linking to?

It may just be memory related:

  • if this is deeply nested, you might simply have a stack overflow.
  • if this is the first time it's being called, perhaps it is trying to allocate memory for the library, load it in, and link it, and that failed due to hitting a memory limit

If this code path is called many many times, and never crashes elsewhere, maybe it's time to run memtest86 overnight.

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
abligh
  • 24,573
  • 4
  • 47
  • 84
  • "Might it be that you are compiling against a different set of headers from the standard library you are linking to?" - Could be, yes. I definitely compiled the program on antoher machine than the crashing one. They have the same GCC version, though. I'll check that – Deve Apr 03 '15 at 07:37
  • Turns out the program was compiled against version 6.0.19 of the standard library. The system where it crashed has version 6.0.18 of `libstdc++`. If this is the reason for the crash (which I still have to verify) can I solve it by linking the library statically when compiling? – Deve Apr 03 '15 at 07:55
  • I had thought it was only running on a different major version that would cause that, but yes, you can static link. Run `ldd -v` against the binary to check you removed the dependency. – abligh Apr 03 '15 at 08:35