How do you find out the cause of rare crashes that are caused by things that are not caught by try catch (access violation, divide by zero, etc.)?

Question

I am a .NET programmer who is starting to dabble into C++. In C# I would put the root function in a try catch, this way I would catch all exceptions, save the stack trace, and this way I would know what caused the exception, significantly reducing the time spent debugging.

But in C++ some stuff(access violation, divide by zero, etc.) are not caught by try catch. How do you deal with them, how do you know which line of code caused the error?

For example let's assume we have a program that has 1 million lines of code. It's running 24/7, has no user-interaction. Once in a month it crashes because of something that is not caught by try catch. How do you find out which line of code caused the crash?

Environment: Windows 10, MSVC.

https://stackoverflow.com/questions/20237201/best-way-to-have-crash-dumps-generated-when-processes-crash — Retired Ninja, Aug 08 '22 at 01:09
Typically a dump is written out by the operating system in `%LOCALAPPDATA%\CrashDumps`. You can load that into Visual Studio and at least find where it crashed, examine the threads running at the time, all the call stacks _etc_. Some bugs are more elusive than this, and you may need strategic logging and/or lots of trial and error, reading through code. Out of curiosity, do you _actually_ have a million-line C++ program running 24/7 unattended with no logging, crash reporting and no engineers with basic debugging skills? — paddy, Aug 08 '22 at 01:13
4 things: 1) Preliminary. Your build produces PDB file symbols for your native code EXE and DLL binaries. You save these off and don't lose these files. 2) Really good logging - so you can figure out what was going on the moment before the crash 3) Ability to collect crash dumps and analyze them later. Turn on crash dump collection for your EXE name. [Details here](https://docs.microsoft.com/en-us/windows/win32/wer/collecting-user-mode-dumps). Then use Windbg to diagnose the crash with that build's symbol files as explained in step 1. 4) Really good debugging skills all around. — selbie, Aug 08 '22 at 01:15
The first thing you'll want to do is figure out a method to make the crash occur quickly/on-demand.... perhaps some kind of torture-test. Without that, even if you collect a crash-dump, you'll have no way to know if (whatever changes you make to the codebase in response to the information you collected) actually fixed the bug or not, which means you're just as likely to start adding more bugs to the code as to fix existing bugs. — Jeremy Friesner, Aug 08 '22 at 01:15
You deal with them by figuring out which bugs cause them, and then fix them. Unfortunately, there is no cookie-cutter, paint-by-numbers, step by step recipe to fix an atbirary crash, like that. This is always investigated and researched on a case by case basis. As one gains C++ experience they'll also learn defensive programming techniques that make these kinds of bugs logically impossible, that's the best way to deal with them. — Sam Varshavchik, Aug 08 '22 at 01:22
*let's assume we have a program that has 1 million lines of code. It's running 24/7, has no user-interaction* -- Well, as a previous comment mentioned, the developers would be insane to write such a program and have no contingencies prepared to debug such a program. Access violations happen in small programs also -- let's assume that it is a program that has been distributed to thousands of customers, and an access violation occurs with one customer. How would you debug this? The same way you would attempt to debug a million line program -- logs, crash dumps, etc. — PaulMcKenzie, Aug 08 '22 at 01:22

Something Something · Accepted Answer · 2022-08-08T01:49:54.643

C++ is meant to be a high performance language and checks are expensive. You can't run at C++ speeds and at the same time have all sorts of checks. It is by design.

Running .Net this way is akin to running C++ in debug mode with sanitizers on. So if you want to run your application with all the information you can, turn on debug mode in your cmake build and add sanitizers, at least undefined and address sanitizers.

For Windows/MSVC it seems that address sanitizers were just added in 2021. You can check the announcement here: https://devblogs.microsoft.com/cppblog/addresssanitizer-asan-for-windows-with-msvc/

For Windows/mingw or Linux/* you can use Gcc and Clang's builtin sanitizers that have largely the same usage/syntax.

To set your build to debug mode:

cd <builddir>
cmake -DCMAKE_BUILD_TYPE=debug <sourcedir>

To enable sanitizers, add this to your compiler command line: -fsanitize=address,undefined

One way to do that is to add it to your cmake build so altogether it becomes:

cmake -DCMAKE_BUILD_TYPE=debug \
      -DCMAKE_CXX_FLAGS_DEBUG_INIT="-fsanitize=address,undefined" \
      <sourcedir>

Then run your application binary normally as you do. When an issue is found a meaningful message will be printed along with a very informative stack trace.

Alternatively you can set so the sanitizer breaks inside the debugger (gdb) so you can inspect it live but that only works with the undefined sanitizer. To do so, replace

-fsanitize=address,undefined

with

-fsanitize-undefined-trap-on-error -fsanitize-trap=undefined -fsanitize=address

For example, this code has a clear problem:

void doit( int* p ) {
        *p = 10;
}

int main() {
        int* ptr = nullptr;
        doit(ptr);
}

Compile it in the optimized way and you get:

$ g++ -O3 test.cpp -o test
$ ./test
Segmentation fault (core dumped)

Not very informative. You can try to run it inside the debugger but no symbols are there to see.

$ g++ -O3 test.cpp -o test
$ gdb ./test
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2
...
Reading symbols from ./test...
(No debugging symbols found in ./test)
(gdb) r
Starting program: /tmp/test 

Program received signal SIGSEGV, Segmentation fault.
0x0000555555555044 in main ()
(gdb)

That's useless so we can turn on debug symbols with

$ g++ -g3 test.cpp -o test
$ gdb ./test
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
  ...
Reading symbols from ./test...
(gdb) r
Starting program: /tmp/test 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
test.cpp:4:5: runtime error: store to null pointer of type 'int'

Program received signal SIGSEGV, Segmentation fault.
0x0000555555555259 in doit (p=0x0) at test.cpp:4
4               *p = 10;

Then you can inspect inside:

(gdb) p p
$1 = (int *) 0x0

Now, turn on sanitizers to get even more messages without the debugger:

$ g++ -O0 -g3 test.cpp -fsanitize=address,undefined -o test
$ ./test
test.cpp:4:5: runtime error: store to null pointer of type 'int'
AddressSanitizer:DEADLYSIGNAL
=================================================================
==931717==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x563b7b66c259 bp 0x7fffd167c240 sp 0x7fffd167c230 T0)
==931717==The signal is caused by a WRITE memory access.
==931717==Hint: address points to the zero page.
    #0 0x563b7b66c258 in doit(int*) /tmp/test.cpp:4
    #1 0x563b7b66c281 in main /tmp/test.cpp:9
    #2 0x7f36164a9082 in __libc_start_main ../csu/libc-start.c:308
    #3 0x563b7b66c12d in _start (/tmp/test+0x112d)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /tmp/test.cpp:4 in doit(int*)
==931717==ABORTING

That is much better!

How do you find out the cause of rare crashes that are caused by things that are not caught by try catch (access violation, divide by zero, etc.)?

1 Answers1