Best practices for recovering from a segmentation fault

Question

I am working on a multithreaded process written in C++, and am considering modifying SIGSEGV handling using google-coredumper to keep the process alive when a segmentation fault occurs.

However, this use of google-coredumper seems ripe with opportunities to get stuck in an infinite loop of core dumps unless I somehow reinitialize the thread and the object that may have caused the core dump.

What best practices should I keep in mind when trying to keep a process alive through a core dump? What other 'gotchas' should I be aware of?

Thanks!

You cannot in general "keep the process alive". A segmentation fault occurs **after** the program has already been put into an indeterminate, invalid state; there is *no way* you can continue from that state. — Kerrek SB, Dec 06 '11 at 14:50
I may have misinterpreted the project descriptions intent, but it says "The coredumper library can be compiled into applications to create core dumps of the running program -- without terminating." I'm not sure what purpose 'without terminating' would buy unless it was to attempt to recover. — Sam, Dec 06 '11 at 15:05
Presumably, your application isn't supposed to crash! You can make coredumps of a live process, but that doesn't magically repair bugs for you. — Kerrek SB, Dec 06 '11 at 15:08
Sam, that library is for creating core dumps *on demand*. It's not meant to suppress the crash that comes from an unhandled signal. You need to do that part yourself (but only under circumstances where it makes sense, and an unexpected segmentation fault is not one of those circumstances). The library is for generating a snapshot of your process so you can compare it with a later snapshot. — Rob Kennedy, Dec 06 '11 at 15:22
I'll just leave this here as food for thought: http://feepingcreature.github.com/handling.html – however I'd say this technique is best used to show the user a better understandable user intrface and some sensible error information then exit process. — datenwolf, Nov 26 '12 at 02:03

score 41 · Answer 1 · edited Oct 10 '12 at 16:38

41

It is actually possible in C. You can achieve it in quite a complicated way:

1) Override signal handler

2) Use setjump() and longjmp() to set the place to jump back, and to actually jump back to there.

Check out this code I wrote (idea taken from "Expert C Programming: Deep C Secrets" by Peter Van Der Linden):

#include <signal.h>
#include <stdio.h>
#include <setjmp.h>

//Declaring global jmp_buf variable to be used by both main and signal handler
jmp_buf buf;


void magic_handler(int s)
{

    switch(s)
    {

        case SIGSEGV:
        printf("\nSegmentation fault signal caught! Attempting recovery..");
        longjmp(buf, 1);
        break;
    }

    printf("\nAfter switch. Won't be reached");

}



int main(void) 
{

    int *p = NULL;

    signal(SIGSEGV, magic_handler);

    if(!setjmp(buf))
    {

         //Trying to dereference a null pointer will cause a segmentation fault, 
         //which is handled by our magic_handler now.
         *p=0xdead;

    }
    else
    {
        printf("\nSuccessfully recovered! Welcome back in main!!\n\n"); 
    }



    return 0;
}

edited Oct 10 '12 at 16:38

Steve Czetty

6,147
9
39
48

answered Oct 10 '12 at 16:31

user1735527

513
1
5
8

3

Using `longjmp` and friends does let you isolate in what lines of code the segfault occured but it doesn't let you isolate what memory might have been corrupted. – Praxeolitic Jan 04 '15 at 19:22
1

What it does do, however, is provide for the possibility that the process (which may be responsible for something important in the real world) can hobble along for at least just a little bit longer. – Steven Lu Jun 13 '15 at 17:41
I have found that this method works just enough to help crash gracefully, but the OS will forcibly crash it very soon. – Russell Hankins Sep 14 '18 at 01:36
1

Fun fact. JVM uses segfault to stop the world (before garbage collection). – JAre Feb 20 '21 at 22:19

score 17 · Accepted Answer · edited Jul 15 '19 at 22:28

17

The best practice is to fix the original issue causing the core dump, recompile and then relaunch the application.

To catch these errors before deploying in the wild, do plenty of peer review and write lots of tests.

edited Jul 15 '19 at 22:28

Robert Houghton

1,202
16
28

answered Dec 06 '11 at 14:44

parapura rajkumar

24,045
1
55
85

6

Which is a good idea, but sometimes this application controls e.g. a scientific camera that is likely to be damaged when heating up uncontrolled. Going home and recompiling is not an option in this case. Horrible hardware design, I know, but I have seen this in the wild and it is absolutely crucial to recover from fatal signals in this kind of situation. – thiton Dec 06 '11 at 14:56

score 6 · Answer 3 · answered May 07 '14 at 17:14

Steve's answer is actually a very useful formula. I've used something similar in a piece of complicated embedded software where there was at least one SIGSEGV error in the code that we could not track down by ship time. As long as you can reset your code to have no ill effects (memory or resource leaks) and the error is not something that causes an endless loop it can be a lifesaver (even though its better to fix the bug). FYI in our case it was single thread.

But what is left out is that once you recover from your signal handler, it will not work again unless you unmask the signal. Here is a chunk of code to do that:

sigset_t signal_set;
...
setjmp(buf);
sigemptyset(&signal_set);
sigaddset(&signal_set, SIGSEGV); 
sigprocmask(SIG_UNBLOCK, &signal_set, NULL); 
// Initialize all Variables...

Be sure to free up your memory, sockets and other resources or you could leak memory when this happens.

score 5 · Answer 4 · answered Dec 06 '11 at 14:49

My experience with segmentation faults is that it's very hard to catch them portably, and to do it portably in a multithreaded context is next to impossible.

This is for good reason: Do you really expect the memory (which your threads share) to be intact after a SIGSEGV? After all, you've just proven that some addressing is broken, so the assumption that the rest of the memory space is clean is pretty optimistic.

Think about a different concurrency model, e.g. with processes. Processes don't share their memory or only a well-defined part of it (shared memory), and one process can reasonably work on when another process died. When you have a critical part of the program (e.g. the core temperature control), putting it in an extra process protects it from memory corruption by other processes and segmentation faults.

score 4 · Answer 5 · answered Dec 06 '11 at 14:53

4

If a segmentation fault occurs, you're better off just ditching the process. How can you know that any of your process's memory is usable after this? If something in your program is messing with memory it shouldn't, why do you believe it didn't mess with some other part of memory that your process actually can access without segfaulting?

I think that doing this will mostly benefit attackers.

answered Dec 06 '11 at 14:53

R. Martinho Fernandes

228,013
71
433
510

Usually this is true, but it could still be useful in middleware or some very specific software such as interpreters or JIT compilers. – Netherwire Feb 16 '20 at 09:57

score 1 · Answer 6 · answered Dec 06 '11 at 14:45

From description of coredumper seems it's purpose not what you intending, but just allowing to make snapshots of process memory.

Personally, I wouldn't keep process after it triggered core dump -- it just so many ways it could be broken -- and would employ some persistence to allow data recovery after process is restarted.

And, yes, as parapura has suggested, better yet, find out what causing SIGSEGV and fix it.

Best practices for recovering from a segmentation fault

6 Answers6

Linked

Related