3

I have some production-critical code that has to keep running.

think of the code as

while (true){
   init();
   do_important_things();  //segfault here
   clean();
}

I can't trust the code to be bug-free, and I need to be able to log problems to investigate later.

This time, I know for a fact somewhere in the code there is a segmentation fault getting thrown, and I need to be able to at least log that, and then start everything over.

Reading here there are a few solutions, but following each one is a flame-war claiming the solution will actually do more harm than good, with no real explanation. I also found this answer which I consider using, but I'm not sure it is good for my use case.

So, what is the best way to recover from segmentation fault on C++?

Gulzar
  • 23,452
  • 27
  • 113
  • 201
  • 4
    Create a process that monitors this program and restarts it if it dies. – Ted Lyngmo Dec 05 '21 at 12:25
  • 1
    If you make your program a daemon process, or a service, the Operating System itself can restart it for you if it dies. – Galik Dec 05 '21 at 12:27
  • 5
    I think you need to frame challenge yourself a bit. A segmentation fault means that the OS recognized your process as messing up a resource it handed out to you. You don't know how you messed it up to go about cleaning it, and you don't know if the OS can recover from it. How on earth can your process just assume it may log and carry on? It's now running on quicksand. Restarting the process is the only way to get a proper fresh start. Just make sure to store a core file and examine it later. – StoryTeller - Unslander Monica Dec 05 '21 at 12:52
  • 4
    Don't DO NOT ever recover from a segmentation fault. Fix the real problem – Pepijn Kramer Dec 05 '21 at 13:18
  • [Bug aren't recoverable errors!](https://joeduffyblog.com/2016/02/07/the-error-model/#bugs-arent-recoverable-errors) is just as true in C++ as it is in M#. – Eljay Dec 05 '21 at 13:56
  • 1
    It is meant to be helpful, I've years and years of experience in industry. And leaving a segmentation fault in is sweeping a problem under the rug that's going to bite you one day. Be sure you have a test environment in which you can reproduce this problem, generate crashdumps for debug builds that kind of things. But don't let the segfault stay in the code. The only real backup option is restarting the program. But in control systems this can leave hardware in an undfined state... so you need more recovery algorithms... and... and... as said try to find the cause of the segfault asap. – Pepijn Kramer Dec 05 '21 at 14:54
  • @PepijnKramerI know finding the bug is important and best. But in a large system, 0 bugs is not a guarantee, in fact it can't happen. Thus, I need a mitigation. Thanks, I will do my best to find that bug :) – Gulzar Dec 05 '21 at 15:16
  • 1
    Whereas all programs will contain some bugs, not all bugs are equal. Programs are expected to produce accurate, predictable results. Segmentation faults are particularly severe in that they are usually considered unrecoverable. Triggering a segfault means that data was read from, or written to the wrong place. From that moment forward, you can't trust the integrity of the data your program is working on. A lot of people prefer a program to *terminate* rather than continue, producing false, potentially misleading, information. – Galik Dec 05 '21 at 19:57

1 Answers1

6

I suggest that you create a very small program that you make really safe that monitors the buggy program. If the buggy program exits in a way you don't like, restart the program.

Posix example:

#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>

#include <cstdio>
#include <iostream>

int main(int argc, char* argv[]) {
    if(argc < 2) {
        std::cerr << "USAGE: " << argv[0] << " program_to_monitor <arguments...>\n";
        return 1;
    }

    while(true) {
        pid_t child = fork();          // create a child process

        if(child == -1) {
            std::perror("fork");
            return 1;
        }

        if(child == 0) {
            execvp(argv[1], argv + 1); // start the buggy program
            perror(argv[1]);           // starting failed
            std::exit(0);              // exit with 0 to not trigger a retry
        }

        // Wait for the buggy program to terminate and check the status
        // to see if it should be restarted.

        if(int wstatus; waitpid(child, &wstatus, 0) != -1) {
            if(WIFEXITED(wstatus)) {
                if(WEXITSTATUS(wstatus) == 0) return 0; // normal exit, terminate

                std::cerr << argv[0] << ": " << argv[1] << " exited with "
                          << WEXITSTATUS(wstatus) << '\n';
            }
            if(WIFSIGNALED(wstatus)) {
                std::cerr << argv[0] << ": " << argv[1]
                          << " terminated by signal " << WTERMSIG(wstatus);
                if(WCOREDUMP(wstatus)) std::cout << " (core dumped)";
                std::cout << '\n';
            }
            std::cout << argv[0] << ": Restarting " << argv[1] << '\n';
        } else {
            std::perror("wait");
            break;
        }
    }
}
Ted Lyngmo
  • 93,841
  • 5
  • 60
  • 108
  • A point that is _critical_ is that the watchdog is a separate _process_, not a separate thread sharing memory. Different processes are isolated memory-wise [with minor irrelevant exceptions: shared memory mapped files, debuggers]. A watchdog inside the same process might be broken by the same bug that ended up in segmentation fault. – Pablo H Dec 09 '21 at 17:48
  • @PabloH Indeed. The watchdog should be totally isolated from the process(es) it's monitoring. Having a monitoring _thread_ is not going to do any good since a segfault will destroy the whole process (and all its threads with it). – Ted Lyngmo Dec 09 '21 at 18:02