Why is a segmentation fault not recoverable?

Question

Following a previous question of mine, most comments say "just don't, you are in a limbo state, you have to kill everything and start over". There is also a "safeish" workaround.

What I fail to understand is why a segmentation fault is inherently nonrecoverable.

The moment in which writing to protected memory is caught - otherwise, the SIGSEGV would not be sent.

If the moment of writing to protected memory can be caught, I don't see why - in theory - it can't be reverted, at some low level, and have the SIGSEGV converted to a standard software exception.

Please explain why after a segmentation fault the program is in an undetermined state, as very obviously, the fault is thrown before memory was actually changed (I am probably wrong and don't see why). Had it been thrown after, one could create a program that changes protected memory, one byte at a time, getting segmentation faults, and eventually reprogramming the kernel - a security risk that is not present, as we can see the world still stands.

When exactly does a segmentation fault happen (= when is SIGSEGV sent)?
Why is the process in an undefined behavior state after that point?
Why is it not recoverable?
Why does this solution avoid that unrecoverable state? Does it even?

The problem is that in most cases a segmentation fault occurs because your program has overwritten memory it should not have, putting your program in some unknown state. E.g: you overwrite a buffer and on that occasion you corrupt the internal bookkeeping of the memory allocation functions such as `malloc` etc. Then somewhat later you call `malloc` which triggers a segfault because of the corruption mentioned before. Then what? It's somewhat like if you jump from a cliff in real life, you cannot recover from that, once you've jumped it's too late. — Jabberwocky, Dec 07 '21 at 10:32
[This](https://stackoverflow.com/questions/8401689/best-practices-for-recovering-from-a-segmentation-fault/12824168#12824168) doesn't recover anything. If you take the example of my previous comment, it just gives you an illusion of recovery. The internal bookkeeping will still be corrupted and the next call to `malloc` will most likely trigger another segfault. — Jabberwocky, Dec 07 '21 at 10:35
Given that it is a coding error or a range error which caused the fault, usually the only recovery possible is a graceful exit, rather than a crash. But if the fault was a stack overflow, that limits the recovery options anyway. The situation is similar to a divide-by-zero error: if there is a good recovery strategy (apart from orderly shutdown), then it should be detected and implemented *before* the fault happens, not left until afterwards. — Weather Vane, Dec 07 '21 at 10:41
@Jabberwocky I explained in the question why it makes no sense that a user-space program could corrupt (=write to) protected memory. 1. If it could, it is a serious, obvious security threat to any operating system, which I find hard to believe exists. 2. It is immediately caught (by the OS?), so why can't it be also immediately reverted, even on assembly level? 3. Please specify exactly the unrecoverable process. The cliff metaphor is fine, but I need the juicy details. — Gulzar, Dec 07 '21 at 10:42
Recovery from a (say) division by zero should be left to the program. Why not convert to a DivisionbyZeroException as if it is thrown BEFORE the actual division? — Gulzar, Dec 07 '21 at 10:44
I disagree: recovering from a fault isn't a viable runtime option for a release version of code out in the market. It should never happen except for exceptional faults like device failure, and is only of real interest at the development / debugging stage. If there is *any* possibility that you will divide by zero, the program should deal with the incorrect data at the earliest opportunity, as part of the algorithm. Anyway, it is so much easier to write the preemptive code than it is to implement a retro-fix. — Weather Vane, Dec 07 '21 at 10:49
The fact is that your program did something to some memory based on an incorrect assumption. The moment that happened, the actual program state departed from the intended program state. In fact, the actual state was already divergent before then. All assumptions from there on about program state cannot be trusted. Terminating on protected memory violation is a great way to prevent further damage from occurring, and while not failsafe is a pretty good early indicator of things going awry. — paddy, Dec 07 '21 at 10:52
**2.** means that the program is - based on how the langue is specified - in an undefined state. This does not necessarily mean that **3.** is true, it, however, means that you can't recover from that state only based on what the langue provides. Knowing everything about the environment the program runs in theoretically would allow you to recover from that state. But that means you need to know everything about the OS, CPU, runtime library, compiler, ... . — t.niese, Dec 07 '21 at 10:54
@Gulzarmy first example contains an example. If the internal bookkeeping of `malloc` (which is not system memory, it's user memory) is corrupted, and this cannot be recovered. — Jabberwocky, Dec 07 '21 at 10:55
Suppose you have something as `std::vector v(1); v[i]=-1;`, where `i` equals 32. It may not trigger segfault by itself (https://godbolt.org/z/sh8TW34j9), but it can corrupt the heap. Sometimes later, some heap function may end up with segfault. How would you recover from it if you don't know about the heap corruption at all at that moment? — Daniel Langr, Dec 07 '21 at 10:56
About **4.** I don't know if the comment below that answer is true (and for what OS), but it says `I have found that this method works just enough to help crash gracefully, but the OS will forcibly crash it very soon.` so shouldn't the question or the further investigation here be if an OS might kill the application after a while even if `SEGENV` was caught? — t.niese, Dec 07 '21 at 11:08
If you need to recover from a segfault, the clean way to do it is launch the program from a parent “babysitter” process, so that when the child process segfaults, the parent process can react by spawning a new child process to replace it (and hopefully also log debugging details so that you can fix the underlying bug ASAP :) ) — Jeremy Friesner, Dec 07 '21 at 19:53
You should clarify that you're asking about unintentional segfaults caused by bugs, rather than intentional ones caused by exotic things like doing userspace paging. — Joseph Sible-Reinstate Monica, Dec 08 '21 at 07:21
It is recoverable. A long while ago (think very early Windows XP) I wrote a C++ SEGV handler that displayed an error message did a longjmp back to the main loop. This was for a Windows program that let you work on documents, and it was frequently sufficient at least to let you save your document and quit. Over time I made it more sophisticated so that if the error occurred during rendering (meaning you could never get to the menu in time), it disabled rendering. Of course it didn't fix the underlying problem but was helpful to the user. This predated C++ exceptions (or we weren't using them) — abligh, Dec 08 '21 at 08:04
Faults are recoverable in general, it's just that SIGSEGV is usually an indication of a memory-corruption bug and those might not be recoverable. — avakar, Dec 08 '21 at 19:33
@Gulzar It might be useful to consider the opposite approach. In many core storage systems (and more generally, production services) it's actually appropriate to crash _on purpose_. If the invariants are being arbitrarily broken, who knows what corruption we might start persisting to disk or serving out. It's a better move to panic and abort at the moment you know a bug is occurring. (And this matches the answer in your prior question: you, in general, must be okay with your program arbitrarily stopping. Maybe someone mistakenly killed it? A daemon + alerts is the solution.) — GManNickG, Dec 08 '21 at 21:52
@GManNickG: People who push for notions of completely unconstrained "Undefined Behavior" ignore the fact that many programs are subject to two constraints: (1) Behave usefully when possible; (2) When unable to behave usefully, behave in *tolerably useless* fashion. There are many situations where multiple behaviors would be equally acceptable, and letting a compiler freely select among ways of processing useless cases could improve performance of code in useful cases, but only if the compiler limits itself to tolerably useless behaviors. — supercat, Dec 08 '21 at 23:17
Think of the problem longer term. In normal operation, a segfault should **never** occur. The best thing the program can do is crash, saving information about the fault in the process, so that the problem can be diagnosed and fixed. If the program tries to recover, it will probably crash again, and again, and again, and any information about the root cause of the problem will be lost. — DaveG, Dec 09 '21 at 16:13
@DaveG ...or even worse, some data might get corrupted in the process. — Jabberwocky, Dec 09 '21 at 16:19
@DaveG like I said, assuming no faults exists is non realistic. Also, it appears no logging can be done *safely* after a fault, maybe you can shed more light on `saving information about the fault in the process`? Finally, likely, it will not happen in production over and over, else it would have been caught in QA. Likely it would happen for some edge case, or after a long runtime. The suggestion to "just don't make bugs" never holds. — Gulzar, Dec 09 '21 at 16:19
@Gulzar _no logging can be done after a fault_: this is only partly true. 1. you can still log during the normal life of the program and when it crashes for some reason you still have the log until the crash which may be very helpful. 2. once the program has crashed you still may be able to dump some debug information, and then launch another process that logs that dump somewhere. You can do this with Windows for sure (I'm doing it) and certainly with other OSes too. — Jabberwocky, Dec 09 '21 at 16:24
@Gulzar preventing segmentation faults is entirely realistic. A segmentation fault should **never** be driven by the end user. It is the result of a coding error such as an incorrect buffer size. — DaveG, Dec 09 '21 at 18:00
@Gulzar, re: "saving information about the fault" -- think core dumps. The core dump handler is outside the faulted process and doesn't depend on its state being valid. — Charles Duffy, Dec 09 '21 at 22:12
From the perspective of C++ there doesn't exist a segfault, but undefined behaviour which is standardese for "all bets are off". In real world applications carefully constructed programs that intentionally create segfaults can recover from them, like for userspace copy on write mmap. So in terms of C++ the question might better be woreded "why can't a program recover from undefine behaviour" — PlasmaHH, Dec 10 '21 at 14:27

Lundin · Accepted Answer · 2021-12-10T15:05:41.193

97

When exactly does segmentation fault happen (=when is SIGSEGV sent)?

When you attempt to access memory you don’t have access to, such as accessing an array out of bounds or dereferencing an invalid pointer. The signal SIGSEGV is standardized but different OS might implement it differently. "Segmentation fault" is mainly a term used in *nix systems, Windows calls it "access violation".

Why is the process in undefined behavior state after that point?

Because one or several of the variables in the program didn’t behave as expected. Let’s say you have some array that is supposed to store a number of values, but you didn’t allocate enough room for all them. So only those you allocated room for get written correctly, and the rest written out of bounds of the array can hold any values. How exactly is the OS to know how critical those out of bounds values are for your application to function? It knows nothing of their purpose.

Furthermore, writing outside allowed memory can often corrupt other unrelated variables, which is obviously dangerous and can cause any random behavior. Such bugs are often hard to track down. Stack overflows for example are such segmentation faults prone to overwrite adjacent variables, unless the error was caught by protection mechanisms.

If we look at the behavior of "bare metal" microcontroller systems without any OS and no virtual memory features, just raw physical memory - they will just silently do exactly as told - for example, overwriting unrelated variables and keep on going. Which in turn could cause disastrous behavior in case the application is mission-critical.

Why is it not recoverable?

Because the OS doesn’t know what your program is supposed to be doing.

Though in the "bare metal" scenario above, the system might be smart enough to place itself in a safe mode and keep going. Critical applications such as automotive and med-tech aren’t allowed to just stop or reset, as that in itself might be dangerous. They will rather try to "limp home" with limited functionality.

Why does this solution avoid that unrecoverable state? Does it even?

That solution is just ignoring the error and keeps on going. It doesn’t fix the problem that caused it. It’s a very dirty patch and setjmp/longjmp in general are very dangerous functions that should be avoided for any purpose.

We have to realize that a segmentation fault is a symptom of a bug, not the cause.

edited Dec 10 '21 at 15:05

answered Dec 07 '21 at 10:52

Lundin

195,001
40
254
396

1

Thanks for the elaborate answer! One thing missing here, about mission critical systems (or any system): In large systems in production, one can't know where, or even if the segfaults are, so the reccomendation to fix the bug and not the symptom does not hold. What would be a mitigation in case the system indeed has to die? Is there a way to at least log some information that would be trustworthy, before starting a new, clean process? – Gulzar Dec 07 '21 at 11:34
Please notice that I want to place some defence against faults, that can be trusted, without knowing the origin or nature of the faults. Fixing them is not an option, though it could be nice. – Gulzar Dec 07 '21 at 11:36
3

@Gulzar Usually you'll get some "core dump" or similar. But you could perhaps implement your own custom logging too by writing a signal handler to SIGSEGV as in the link you posted. As for defence against faults - how do you know the severity of unknown errors? – Lundin Dec 07 '21 at 11:43
In short-term practice, I think my errors are only access to null and nothing more. But the question is more general, and I am looking for some best-practice for production systems I have to deploy and monitor, and that have to keep serving despite of faults (caused by human error, which is not avoidable). I also want to protect my future self. – Gulzar Dec 07 '21 at 11:47
@Gulzar Implementing some manner of logging and/or error handler is always helpful. You could also make a habit of adding as much defensive programming and error checking as possible. Always check function results, do range checks on variables etc. – Lundin Dec 07 '21 at 11:54
16

@Gulzar for a production system, as you never know why the SIGSEGV actually occurred you probably don't want to continue with an application in that state. You instead want to write it in such a way that restarting an application in such an event would minimize the data loss. The problem is that you might assume that the SIGSEGV is unproblematic in your case, but you might have missed a certain error case resulting in an application that continues to run but generates strange or unpredictable results/behaviors. – t.niese Dec 07 '21 at 11:58
@t.niese *You instead want to write it in such a way that restarting an application in such an event would minimize the data loss.* Note that needs to be done anyway because there's no protection against a failure that causes sudden system shutdown - just about any OS or machine instance has multiple single-points-of-failure. And even if the software and hardware both are demonstrably robust, the local sysadmin isn't. It's real fun when a sysadmin accidentally runs `history | xargs rm -f -r ` as root on what's supposed to be a fault-tolerant system that will never go down - true story... – Andrew Henle Dec 07 '21 at 20:42
Point of pedantry: every realizable computer has at least one unrecoverable failure mode — physical destruction. Most of them also stop working real well when they no longer have any source of power… All of which falls under the rubric of "if you aren't having to worry about asteroid impacts, there's only so critical your system can possibly be." – Joel Aelwyn Dec 08 '21 at 04:02
This explanation is not really convincing to me. Why *should* OS care if you are overwriting variables? It's your program, it just does whatever you programmed it to. Perhaps that was a concern when programs did not have virtual memory assigned to them and there was no reason to remove segfault machanism later. Or perhaps it also protects you from creating self-modifyng programs. But "overwriting variables" is a weak argument. – Yksisarvinen Dec 08 '21 at 10:26
3

@Yksisarvinen Because under the hood, virtual memory is handled by MMU hardware setup and application programmers usually don't have access to that. The OS just sits as a layer between your application and the MMU. It's common to have the MMU yell hardware exception when you try to execute code out of data segments or access code segments as if it was data. Also why would you ever want it to silently ignore accidental access of memory? The more diagnostics, the better, usually. – Lundin Dec 08 '21 at 10:41
8

@Yksisarvinen: Re: "Why *should* OS care if you are overwriting variables?": It shouldn't! The point is just that, *since* it doesn't, SIGSEGV means that you're doing something *so* wrong that *even the OS* can tell it's wrong . . . which probably means that your program state is already totally corrupt. – ruakh Dec 08 '21 at 18:41
@Yksisarvinen Because the OS is a tool for running programs. Sure, you may have intended your program to randomly stomp memory, but it's so radically unlikely as to be a negligible probability. In the 99.99-recurring% case, it's a bug you didn't intend. Preventing a buggy program writing outside its bounds ensures that one rogue program can't take down other programs, or indeed the entire OS. Regular BSODs on Win95/98 mostly came from this bug, and from Win95/98 having no such protection. – Graham Dec 09 '21 at 01:09
I fail to see why this argument should apply to segmentation faults but not apply to all the other standard exceptions that can be caught and handled during runtime. – Drake P Dec 09 '21 at 02:35
@DrakeP: Because a "user entered a floating point number instead of an integer" is an exception at a level which the OS can safely ignore. More in general, can you safely recover to a well-defined state? Even if that's not exactly the intended state, at least you can go forward from a known state. – MSalters Dec 09 '21 at 16:50
Downvoted because this includes incorrect / misleading information. – chub500 Dec 10 '21 at 15:50
For example, accessing an array out of bounds will very rarely result in a segfault. Also, while it is true that a segfault is always the result of dereferencing an invalid pointer, it is a subset of this specifically when the pointer points to a region of memory outside of the virtual address space of your process. – chub500 Dec 10 '21 at 15:53
It is even misleading to characterize segfaults as unrecoverable because in principle you can ignore the signal, but this only demonstrates the true nature of the problem: where in your code should the OS return back to your process after the bad page fault? – chub500 Dec 10 '21 at 15:56
There are a couple of cases where your program can be correct and still segfault. One is [overcommit](https://arstechnica.com/civis/viewtopic.php?f=20&t=1240341). The OS could tell you it has provided you with memory but then give you a segfault if you actually try to use it. If you could somehow identify the pointer in question you could recover by behaving as if the allocation had failed. Another trival case is if you raise SIGSEGV yourself. I can think of no sensible reason to do this other than to unit test a SIGSEGV handler. – Bruce Adams May 09 '22 at 15:11

Chris Dodd · Answer 2 · 2021-12-08T23:02:19.987

Please explain why after a segmentation fault the program is in an undetermined state

I think this is your fundamental misunderstanding -- the SEGV does not cause the undetermined state, it is a symptom of it. So the problem is (generally) that the program is in an illegal, unrecoverable state WELL BEFORE the SIGSEGV occurs, and recovering from the SIGSEGV won't change that.

When exactly does segmentation fault happen (=when is SIGSEGV sent)?

The only standard way in which a SIGSEGV occurs is with the call raise(SIGSEGV);. If this is the source of a SIGSEGV, then it is obviously recoverable by using longjump. But this is a trivial case that never happens in reality. There are platform-specific ways of doing things that might result in well-defined SEGVs (eg, using mprotect on a POSIX system), and these SEGVs might be recoverable (but will likely require platform specific recovery). However, the danger of undefined-behavior related SEGV generally means that the signal handler will very carefully check the (platform dependent) information that comes along with the signal to make sure it is something that is expected.

Why is the process in undefined behavior state after that point?

It was (generally) in undefined behavior state before that point; it just wasn't noticed. That's the big problem with Undefined Behavior in both C and C++ -- there's no specific behavior associated with it, so it might not be noticed right away.

Why does this solution avoid that unrecoverable state? Does it even?

It does not, it just goes back to some earlier point, but doesn't do anything to undo or even identify the undefined behavior that cause the problem.

Peter Cordes · Answer 3 · 2021-12-09T12:49:46.777

A segfault happens when your program tries to dereference a bad pointer. (See below for a more technical version of that, and other things that can segfault.) At that point, your program has already tripped over a bug that led to the pointer being bad; the attempt to deref it is often not the actual bug.

Unless you intentionally do some things that can segfault, and intend to catch and handle those cases (see section below), you won't know what got messed up by a bug in your program (or a cosmic ray flipping a bit) before a bad access actually faulted. (And this generally requires writing in asm, or running code you JITed yourself, not C or C++.)

C and C++ don't define the behaviour of programs that cause segmentation faults, so compilers don't make machine-code that anticipates attempted recovery. Even in a hand-written asm program, it wouldn't make sense to try unless you expected some kinds of segfaults, there's no sane way to try to truly recover; at most you should just print an error message before exiting.

If you mmap some new memory at whatever address the access way trying to access, or mprotect it from read-only to read+write (in a SIGSEGV handler), that can let the faulting instruction execute, but that's very unlikely to let execution resume. Most read-only memory is read-only for a reason, and letting something write to it won't be helpful. And an attempt to read something through a pointer probably needed to get some specific data that's actually somewhere else (or to not be reading at all because there's nothing to read). So mapping a new page of zeros to that address will let execution continue, but not useful correct execution. Same for modifying the main thread's instruction pointer in a SIGSEGV handler, so it resumes after the faulting instruction. Then whatever load or store will just have not happened, using whatever garbage was previously in a register (for a load), or similar other results for CISC add reg, [mem] or whatever.

(The example you linked of catching SIGSEGV depends on the compiler generating machine code in the obvious way, and the setjump/longjump depends on knowing which code is going to segfault, and that it happened without first overwriting some valid memory, e.g. the stdout data structures that printf depends on, before getting to an unmapped page, like could happen with a loop or memcpy.)

Expected SIGSEGVs, for example a JIT sandbox

A JIT for a language like Java or Javascript (which don't have undefined behaviour) needs to handle null-pointer dereferences in a well-defined way, by (Java) throwing a NullPointerException in the guest machine.

Machine code implementing the logic of a Java program (created by a JIT compiler as part of a JVM) would need to check every reference at least once before using, in any case where it couldn't prove at JIT-compile time that it was non-null, if it wanted to avoid ever having the JITed code fault.

But that's expensive, so a JIT may eliminate some null-pointer checks by allowing faults to happen in the guest asm it generates, even though such a fault will first trap to the OS, and only then to the JVM's SIGSEGV handler.

If the JVM is careful in how it lays out the asm instructions its generating, so any possible null pointer deref will happen at the right time wrt. side-effects on other data and only on paths of execution where it should happen (see @supercat's answer for an example), then this is valid. The JVM will have to catch SIGSEGV and longjmp or whatever out of the signal handler, to code that delivers a NullPointerException to the guest.

But the crucial part here is that the JVM is assuming its own code is bug-free, so the only state that's potentially "corrupt" is the guest actual state, not the JVM's data about the guest. This means the JVM is able to process an exception happening in the guest without depending on data that's probably corrupt.

The guest itself probably can't do much, though, if it wasn't expecting a NullPointerException and thus doesn't specifically know how to repair the situation. It probably shouldn't do much more than print an error message and exit or restart itself. (Pretty much what a normal ahead-of-time-compiled C++ program is limited to.)

Of course the JVM needs to check the fault address of the SIGSEGV and find out exactly which guest code it was in, to know where to deliver the NullPointerException. (Which catch block, if any.) And if the fault address wasn't in JITed guest code at all, then the JVM is just like any other ahead-of-time-compiled C/C++ program that segfaulted, and shouldn't do much more than print an error message and exit. (Or raise(SIGABRT) to trigger a core dump.)

Being a JIT JVM doesn't make it any easier to recover from unexpected segfaults due to bugs in your own logic. The key thing is that there's a sandboxed guest which you're already making sure can't mess up the main program, and its faults aren't unexpected for the host JVM. (You can't allow "managed" code in the guest to have fully wild pointers that could be pointing anywhere, e.g. to guest code. But that's normally fine. But you can still have null pointers, using a representation that does in practice actually fault if hardware tries to deref it. That doesn't let it write or read the host's state.)

For more about this, see Why are segfaults called faults (and not aborts) if they are not recoverable? for an asm-level view of segfaults. And links to JIT techniques that let guest code page-fault instead of doing runtime checks:

Effective Null Pointer Check Elimination Utilizing Hardware Trap a research paper on this for Java, from three IBM scientists.
SableVM: 6.2.4 Hardware Support on Various Architectures about NULL pointer checks

A further trick is to put the end of an array at the end of a page (followed by a large-enough unmapped region), so bounds-checking on every access is done for free by the hardware. If you can statically prove the index is always positive, and that it can't be larger than 32 bit, you're all set.

Implicit Java Array Bounds Checking on 64-bit Architectures. They talk about what to do when array size isn't a multiple of the page size, and other caveats.

Background: what are segfaults

The usual reason for the OS delivering SIGSEGV is after your process triggers a page fault that the OS finds is "invalid". (I.e. it's your fault, not the OS's problem, so it can't fix it by paging in data that was swapped out to disk (hard page fault) or copy-on-write or zero a new anonymous page on first access (soft page fault), and updating the hardware page tables for that virtual page to match what your process logically has mapped.).

The page-fault handler can't repair the situation because the user-space thread normally because user-space hasn't asked the OS for any memory to be mapped to that virtual address. If it did just try to resume user-space without doing anything to the page table, the same instruction would just fault again, so instead the kernel delivers a SIGSEGV. The default action for that signal is to kill the process, but if user-space has installed a signal handler it can catch it.

Other reasons include (on Linux) trying to run a privileged instruction in user-space (e.g. an x86 #GP "General Protection Fault" hardware exception), or on x86 Linux a misaligned 16-byte SSE load or store (again a #GP exception). This can happen with manually-vectorized code using _mm_load_si128 instead of loadu, or even as a result of auto-vectorization in a program with undefined behaviour: Why does unaligned access to mmap'ed memory sometimes segfault on AMD64? (Some other OSes, e.g. MacOS / Darwin, deliver SIGBUS for misaligned SSE.)

Segfaults usually only happen after your program encountered a bug

So your program state is already messed up, that's why there was for example a NULL pointer where you expected one to be non-NULL, or otherwise invalid. (e.g. some forms of use-after free, or a pointer overwritten with some bits that don't represent a valid pointer.)

If you're lucky it will segfault and fail early and noisily, as close as possible to the actual bug; if you're unlucky (e.g. corrupting malloc bookkeeping info) you won't actually segfault until long after the buggy code executed.

Unaligned accesses give SIGBUS on most POSIX systems -- Linux on x86 is a weird outlier here. Privileged instructions usually give a SIGILL. — Chris Dodd, Dec 24 '21 at 01:28

score 21 · Answer 4 · answered Dec 07 '21 at 22:04

The thing you have to understand about segmentation faults is that they are not a problem. They are an example of the Lord's near-infinite mercy (according to an old professor I had in college). A segmentation fault is a sign that something is very wrong, and your program thought it was a good idea to access memory where there was no memory to be had. That access is not in itself the problem; the problem came at some indeterminate time before, when something went wrong, that eventually caused your program to think that this access was a good idea. Accessing non-existent memory is just a symptom at this point, but (and this is where the Lord's mercy comes into it) it's an easily-detected symptom. It could be much worse; it could be accessing memory where there is memory to be had, just, the wrong memory. The OS can't save you from that.

The OS has no way to figure out what caused your program to believe something so absurd, and the only thing it can do is shut things down, before it does something else insane in a way the OS can't detect so easily. Usually, most OSes also provide a core dump (a saved copy of the program's memory), which could in theory be used to figure out what the program thought it was doing. This isn't really straightforward for any non-trivial program, but that's why the OS does it, just in case.

score 12 · Answer 5 · answered Dec 08 '21 at 12:55

While your question asks specifically about segmentation faults, the real question is:

If a software or hardware component is commanded to do something nonsensical or even impossible, what should it do? Do nothing at all? Guess what actually needs to be done and do that? Or use some mechanism (such as "throwing an exception") to halt the higher-level computation which issued the nonsensical command?

The vast weight of experience gathered by many engineers, over many years, agrees that the best answer is halting the overall computation, and producing diagnostic information which may help someone figure out what is wrong.

Aside from illegal access to protected or nonexistent memory, other examples of 'nonsensical commands' include telling a CPU to divide an integer by zero or to execute junk bytes which do not decode to any valid instruction. If a programming language with run-time type checking is used, trying to invoke any operation which is not defined for the data types involved is another example.

But why is it better to force a program which tries to divide by zero to crash? Nobody wants their programs to crash. Couldn't we define division-by-zero to equal some number, such as zero, or 73? And couldn't we create CPUs which would skip over invalid instructions without faulting? Maybe our CPUs could also return some special value, like -1, for any read from a protected or unmapped memory address. And they could just ignore writes to protected addresses. No more segfaults! Whee!

Certainly, all those things could be done, but it wouldn't really gain anything. Here's the point: While nobody wants their programs to crash, not crashing does not mean success. People write and run computer programs to do something, not just to "not crash". If a program is buggy enough to read or write random memory addresses or attempt to divide by zero, the chances are very low that it will do what you actually want, even if it is allowed to continue running. On the other hand, if the program is not halted when it attempts crazy things, it may end up doing something that you do not want, such as corrupting or destroying your data.

Historically, some programming languages have been designed to always "just do something" in response to nonsensical commands, rather than raising a fatal error. This was done in a misguided attempt to be more friendly to novice programmers, but it always ended badly. The same would be true of your suggestion that operating systems should never crash programs due to segfaults.

score 10 · Answer 6 · answered Dec 07 '21 at 20:56

10

At the machine-code level, many platforms would allow programs that are "expecting" segmentation faults in certain circumstances to adjust the memory configuration and resume execution. This may be useful for implementing things like stack monitoring. If one needs to determine the maximum amount of stack that was ever used by an application, one could set the stack segment to allow access only to a small amount of stack, and then respond to segmentation faults by adjusting the bounds of the stack segment and resuming code execution.

At the C language level, however, supporting such semantics would greatly impede optimization. If one were to write something like:

void test(float *p, int *q)
{
  float temp = *p;
  if (*q += 1)
    function2(temp);
}

a compiler might regard the read of *p and the read-modify-write sequence on *q as being unsequenced relative to each other, and generate code that only reads *p in cases where the initial value of *q wasn't -1. This wouldn't affect program behavior anything if p were valid, but if p was invalid this change could result in the segment fault from the access to *p occurring after *q was incremented even though the access that triggered the fault was performed before the increment.

For a language to efficiently and meaningfully support recoverable segment faults, it would have to document the range of permissible and non-permissible optimizations in much more detail than the C Standard has ever done, and I see no reason to expect future versions of the C Standard to include such detail.

answered Dec 07 '21 at 20:56

supercat

77,689
9
166
211

There is `restrict` keyword in C for compilers to optimize – qwr Dec 08 '21 at 15:28
@qwr: The `restrict` keyword allows some optimizations, but it can't handle cases where pointers are guaranteed to identify either the same array segment or disjoint array segments, but never to identify *partially* overlapping array segments. Further, because of sloppiness in the spec, equality comparisons between restirct-qualified pointers and other pointers that may or be based upon them are essentially broken in ways that both clang and gcc "exploit" so as to make them useless. In a construct like `if (restrictPtr == otherPtr) *restrictPtr = 123;`, it's ambiguous... – supercat Dec 08 '21 at 16:53
...whether the pointer value used in the lvalue `*restrictPtr` is based upon `restrictPtr`, and neither clang nor gcc will reliably recognize it as being so (the way the Standard's "formal specification of restrict" is written, replacing `*restrictPtr = 123` with `*otherPtr = 123;` would never observably affect program behavior, and since `*otherPtr = 123;` would access storage with a pointer not based on `restrictPtr`, the assignment `*restrictPtr = 123;` could be treated as doing likewise). – supercat Dec 08 '21 at 17:07
1

@qwr: The Standard could be much easier to reason about and process correctly in all corner cases if for each pointer `p` there was a three-way split of other pointers: those that were definitely based upon `p`, those that were definitely not based upon `p`, and those fitting neither category, with the pointers in the latter category being usable to access storage which was accessed by either of the first two. If one accepts that some pointers won't be classifiable as definitely based upon P or definitely not based upon P, one can use simple and unambiguous rules to handle everything else. – supercat Dec 08 '21 at 17:15

score 9 · Answer 7 · answered Dec 07 '21 at 20:57

9

It is recoverable, but it is usually a bad idea. For example Microsoft C++ compiler has option to turn segfaults into exceptions.

You can see the Microsoft SEH documentation, but even they do not suggest using it.

answered Dec 07 '21 at 20:57

NoSenseEtAl

28,205
28
128
277

1

And it's only "recoverable" in the sense that the process doesn't exit immediately. It certainly isn't a good idea to just ignore the error and continue on your merry way. – Luaan Dec 08 '21 at 08:16

score 7 · Answer 8 · answered Dec 08 '21 at 03:59

Honestly if I could tell the computer to ignore a segmentation fault. I would not take this option.

Usually the segmentation fault occurs because you are dereferencing either a null pointer or a deallocated pointer. When dereferencing null the behavior is completely undefined. When referencing a deallocated pointer the data you are pulling either could be the old value, random junk or in the worst case values from another program. In either case I want the program to segfault and not continue and report junk calculations.

bta · Answer 9 · 2021-12-08T04:49:23.797

Segmentation faults were a constant thorn in my side for many years. I worked primarily on embedded platforms and since we were running on bare metal, there was no file system on which to record a core dump. The system just locked up and died, perhaps with a few parting characters out the serial port. One of the more enlightening moments from those years was when I realized that segmentation faults (and similar fatal errors) are a good thing. Experiencing one is not good, but having them in place as hard, unavoidable failure points is.

Faults like that aren't generated lightly. The hardware has already tried everything it can to recover, and the fault is the hardware's way of warning you that continuing is dangerous. So much, in fact, that bringing the whole process/system crashing down is actually safer than continuing. Even in systems with protected/virtual memory, continuing execution after this sort of fault can destabilize the rest of the system.

If the moment of writing to protected memory can be caught

There are more ways to get into a segfault than just writing to protected memory. You can also get there by e.g., reading from a pointer with an invalid value. That's either caused by previous memory corruption (the damage has already been done, so it's too late to recover) or by a lack of error checking code (should have been caught by your static analyzer and/or tests).

Why is it not recoverable?

You don't necessarily know what caused the problem or what the extent of it is, so you can't know how to recover from it. If your memory has been corrupted, you can't trust anything. The cases where this would be recoverable are cases where you could have detected the problem ahead of time, so using an exception isn't the right way to solve the problem.

Note that some of these types of problems are recoverable in other languages like C#. Those languages typically have an extra runtime layer that's checking pointer addresses ahead of time and throwing exceptions before the hardware generates a fault. You don't have any of that with low-level languages like C, though.

Why does this solution avoid that unrecoverable state? Does it even?

That technique "works", but only in contrived, simplistic use cases. Continuing to execute is not the same as recovering. The system in question is still in the faulted state with unknown memory corruption, you're just choosing to continue blazing onward instead of heeding the hardware's advice to take the problem seriously. There's no telling what your program would do at that point. A program that continues to execute after potential memory corruption would be an early Christmas gift for an attacker.

Even if there wasn't any memory corruption, that solution breaks in many different common use cases. You can't enter a second protected block of code (such as inside a helper function) while already inside of one. Any segfault that happens outside a protected block of code will result in a jump to an unpredictable point in your code. That means every line of code needs to be in a protective block and your code will be obnoxious to follow. You can't call external library code, since that code doesn't use this technique and won't set the setjmp anchor. Your "handler" block can't call library functions or do anything involving pointers or you risk needing endlessly-nested blocks. Some things like automatic variables can be in an unpredictable state after a longjmp.

One thing missing here, about mission critical systems (or any system): In large systems in production, one can't know where, or even if the segfaults are, so the reccomendation to fix the bug and not the symptom does not hold.

I don't agree with this thought. Most segmentation faults that I've seen are caused by dereferencing pointers (directly or indirectly) without validating them first. Checking pointers before you use them will tell you where the segfaults are. Split up complex statements like my_array[ptr1->offsets[ptr2->index]] into multiple statements so that you can check the intermediate pointers as well. Static analyzers like Coverity are good about finding code paths where pointers are used without being validated. That won't protect you against segfaults caused by outright memory corruption, but there's no way to recover from that situation in any case.

In short-term practice, I think my errors are only access to null and nothing more.

Good news! This whole discussion is moot. Pointers and array indices can (and should!) be validated before they are used, and checking ahead of time is far less code than waiting for a problem to happen and trying to recover.

t.niese · Answer 10 · 2021-12-07T12:40:03.783

This might not be a complete answer, and it is by no means complete or accurate, but it doesn't fit into a comment

So a SIGSEGV can occur when you try to access memory in a way that you should not (like writing to it when it is read-only or reading from an address range that is not mapped). Such an error alone might be recoverable if you know enough about the environment.

But how do you want to determine why that invalid access happened in the first place.

In one comment to another answer you say:

short-term practice, I think my errors are only access to null and nothing more.

No application is error-free so why do you assume if null pointer access can happen that your application does not e.g. also have a situation where a use after free or an out of bounds access to "valid" memory locations happens, that doesn't immediately result in an error or a SIGSEGV.

A use-after-free or out-of-bounds access could also modify a pointer into pointing to an invalid location or into being a nullptr, but it could also have changed other locations in the memory at the same time. If you now only assume that the pointer was just not initialized and your error handling only considers this, you continue with an application that is in a state that does not match your expectation or one of the compilers had when generating the code.

In that case, the application will - in the best case - crash shortly after the "recovery" in the worst case some variables have faulty values but it will continue to run with those. This oversight could be more harmful for a critical application than restarting it.

If you however know that a certain action might under certain circumstances result in a SIGSEGV you can handle that error, e.g. that you know that the memory address is valid, but that the device the memory is mapped to might not be fully reliable and might cause a SIGSEGV due to that then recovering from a SIGSEGV might be a valid approach.

score 5 · Answer 11 · answered Dec 09 '21 at 11:16

Depends what you mean by recovery. The only sensible recovery in case the OS sends you the SEGV signal is to clean up your program and spin another one from the start, hopefully not hitting the same pitfall.

You have no way to know how much your memory got corrupted before the OS called an end to the chaos. Chances are if you try to continue from the next instruction or some arbitrary recovery point, your program will misbehave further.

The thing that it seems many of the upvoted responses are forgetting is that there are applications in which segfaults can happen in production without a programming error. And where high availability, decades of lifetime and zero maintenance are expected. In those environments, what's typically done is that the program is restarted if it crashes for any reason, segfault included. Additionally, a watchdog functionality is used to ensure that the program does not get stuck in an unplanned infinite loop.

Think of all the embedded devices you rely on that have no reset button. They rely on imperfect hardware, because no hardware is perfect. The software has to deal with hardware imperfections. In other words, the software must be robust against hardware misbehavior.

Embedded isn't the only area where this is crucial. Think of the amount of servers handling just StackOverflow. The chance of ionizing radiation causing a single event upset is tiny if you look at any one operation at ground level, but this probability becomes non-trivial if you look at a large number of computers running 24/7. ECC memory helps against this, but not everything can be protected.

score 4 · Answer 12 · answered Dec 07 '21 at 23:52

Your program is an undertermined state because C can't define the state. The bugs which cause these errors are undefined behavior. This is the nastiest class of bad behaviors.

The key issue with recovering from these things is that, being undefined behavior, the complier is not obliged to support them in any way. In particular, it may have done optimizations which, if only defined behaviors occur, provably have the same effect. The compiler is completely within its rights to reorder lines, skip lines, and do all sorts of fancy tricks to make your code run faster. All it has to do is prove that the effect is the same according to the C++ virtual machine model.

When an undefined behavior occurs, all that goes out the window. You may get into difficult situations where the compiler has reordered operations and now can't get you to a state which you could arrive at by executing your program for a period of time. Remember that assignments erase the old value. If an assignment got moved up before the line that segfaulted, you can't recover the old value to "unwind" the optimization.

The behavior of this reordered code was indeed identical to the original, as long as no undefined behavior occurred. Once the undefined behavior occurred, it exposes the fact that the reorder occurred and could change results.

The tradeoff here is speed. Because the compiler isn't walking on eggshells, terrified of some unspecified OS behavior, it can do a better job of optimizing your code.

Now, because undefined behavior is always undefined behavior, no matter how much you wish it wasn't, there cannot be a spec C++ way to handle this case. The C++ language can never introduce a way to resolve this, at least short of making it defined behavior, and paying the costs for that. On a given platform and compiler, you may be able to identify that this undefined behavior is actually defined by your compiler, typically in the form of extensions. Indeed, the answer I linked earlier shows a way to turn a signal into an exception, which does indeed work on at least one platform/compiler pair.

But it always has to be on the fringe like this. The C++ developers value the speed of optimized code over defining this undefined behavior.

score 4 · Answer 13 · answered Dec 08 '21 at 22:01

As you use the term SIGSEGV I believe you are using a system with an operating system and that the problem occurs in your user land application.

When the application gets the SIGSEGV it is a symptom of something gone wrong before the memory access. Sometimes it can be pinpointed to exactly where things went wrong, generally not. So something went wrong, and a while later this wrong was the cause of a SIGSEGV. If the error happened "in the operating system" my reaction would be to shut down the system. With a very specific exceptions -- when the OS has a specific function to check for memory card or IO card installed (or perhaps removed).

In the user land I would probably divide my application into several processes. One or more processes would do the actual work. Another process would monitor the worker process(es) and could discover when one of them fails. A SIGSEGV in a worker process could then be discovered by the monitor process, which could restart the worker process or do a fail-over or whatever is deemed appropriate in the specific case. This would not recover the actual memory access, but might recover the application function.

You might look into the Erlang philosophy of "fail early" and the OTP library for further inspiration about this way of doing things. It does not handle SIGSEGV though, but several other types of problems.

score 4 · Answer 14 · answered Dec 10 '21 at 15:02

Your program cannot recover from a segmentation fault because it has no idea what state anything is in.

Consider this analogy.

You have a nice house in Maine with a pretty front garden and a stepping stone path running across it. For whatever reason, you've chosen to connect each stone to the next with a ribbon (a.k.a. you've made them into a singly-linked list).
One morning, coming out of the house, you step onto the first stone, then follow the ribbon to the second, then again to the third but, when you step onto the fourth stone, you suddenly find yourself in Albuquerque.

Now tell us - how do you recover from that?

Your program has the same quandary.
Something went spectacularly wrong but your program has no idea what it was, or what caused it or how to do anything useful about it.
Hence: it crashes and burns.

score 3 · Answer 15 · answered Dec 08 '21 at 11:14

It is absolutely possible, but this would duplicate existing functionality in a less stable way.

The kernel will already receive a page fault exception when a program accesses an address that is not yet backed by physical memory, and will then assign and potentially initialize a page according to the existing mappings, and then retry the offending instruction.

A hypothetical SEGV handler would do the exact same thing: decide what should be mapped at this address, create the mapping and retry the instruction -- but with the difference that if the handler would incur another SEGV, we could go into an endless loop here, and detection would be difficult since that decision would need to look into the code -- so we'd be creating a halting problem here.

The kernel already allocates memory pages lazily, allows file contents to be mapped and supports shared mappings with copy-on-write semantics, so there isn't much to gain from this mechanism.

Poopoo · Answer 16 · 2021-12-07T19:37:47.477

-6

So far, answers and comments have responded through the lens of a higher-level programming model, which fundamentally limits the creativity and potential of the programmer for their convenience. Said models define their own semantics and do not handle segmentation faults for their own reasons, whether simplicity, efficiency or anything else. From that perspective, a segfault is an unusual case that is indicative of programmer error, whether the userspace programmer or the programmer of the language's implementation. The question, however, is not about whether or not it's a good idea, nor is it asking for any of your thoughts on the matter.

In reality, what you say is correct: segmentation faults are recoverable. You can, as any regular signal, attach a handler for it with sigaction. And, yes, your program can most certainly be made in such a way that handling segmentation faults is a normal feature.

One obstacle is that a segmentation fault is a fault, not an exception, which is different in regards to where control flow returns to after the fault has been handled. Specifically, a fault handler returns to the same faulting instruction, which will continue to fault indefinitely. This isn't a real problem, though, as it can be skipped manually, you may return to a specified location, you may attempt to patch the faulting instruction into becoming correct or you may map said memory into existence if you trust the faulting code. With proper knowledge of the machine, nothing is stopping you, not even those spec-wielding knights.

edited Dec 07 '21 at 19:37

answered Dec 07 '21 at 19:19

Poopoo

21
2

My old manual said you could `longjmp()` out of the signal handler. – Joshua Dec 07 '21 at 19:34
10

*In reality, what you say is correct: segmentation faults are recoverable.* That is soooo ***wrong***. In general, **no, they are not**. When you get a `SIGSEGV` when you call `malloc()` or `free()`, all you know is you have a corrupt heap. You have no real way to tell where that corruption is nor what the cause is. And you certainly have no way to fix it. – Andrew Henle Dec 07 '21 at 20:26
@AndrewHenle Seems you have skipped my first paragraph. `malloc` and `free` indicate a higher-level programming model. – Poopoo Dec 07 '21 at 20:27
5

No, I didn't. Once you get into that state, you can't tell how you got there. All you know is that you're in a mine field and you've already stepped on one land mine. There's no guaranteed safe path out in general. – Andrew Henle Dec 07 '21 at 20:28
You are still missing my point. If you use `malloc` and `free`, you are voluntarily giving up your ability to handle segmentation faults by using a higher-level programming model that is nondeterministic. They are still, nonetheless, recoverable to the machine. – Poopoo Dec 07 '21 at 20:30
4

OK, then, explain how, in general, you can use only async-signal-safe functions to recover from a `SIGSEGV` in a controlled fashion, from any context in a way that's a demonstrable and clear improvement on dropping a core file and giving up. – Andrew Henle Dec 07 '21 at 20:32
5

And "don't use a higher-level programming model" is a cop-out. It's an academic abstraction along the lines of an "irresistible force" or "immovable object" - it doesn't exist in reality. The mere concept of "segmentation fault" can only exist in a complex model in the first place. – Andrew Henle Dec 07 '21 at 20:38
3

Segmentation faults are "recoverable" in the sense that you can force program execution to resume after one. They are generally not "recoverable" in that you can put the program back into a defined state afterwards. – Mark Dec 07 '21 at 22:08
1

This answer is a bit ridiculous. Yes programmers do make mistakes that lead to segmentation faults, but I have also seen cases where the compiler or even standard library itself has caused segmentation faults. These standard library / compiler errors are significantly rarer but they happen. Don't blame the programmer for everything especially in a language like C or C++. – Zachary Kraus Dec 08 '21 at 03:50
2

You can catch SIGSEGV and do some limited stuff like printing a backtrace before exiting, or maybe even doing an `execve` to restart yourself. But that's not fully *recovering* - when people say that, they're talking about repairing the situation and resuming execution. Like mapping some (read+write) memory at the pointed-to address could let the faulting instruction complete, but to what purpose? Assuming you weren't expecting a SIGSEGV in the first place, some other state is probably already messed up, even in a program hand-written in asm. – Peter Cordes Dec 08 '21 at 05:35
1

Re: expecting a SIGSEGV, e.g. in a JIT sandbox that uses the end of a page as the end of an array, to get bounds checking for free. Some real JITs do this, see [Implicit Java Array Bounds Checking on 64-bit Architectures](https://www2.cs.arizona.edu/~dkl/Publications/Papers/ics.pdf). Or as mentioned in [Why are segfaults called faults (and not aborts) if they are not recoverable?](//stackoverflow.com/q/49396346) JVMs can eliminate some null pointer checks in JITed code by catching SIGSEGV after the fact. So yes, if you're expecting segfaults, then you can catch them and treat as exceptions. – Peter Cordes Dec 08 '21 at 05:37
2

Of course, those things are only possible in assembly language, because the JIT knows exactly how it's laying out instructions. If an ahead-of-time-compiled C++ program itself segfaults, there's no well-defined way to recover. This is a [c] [c++] question. Anything that could lead to a segfault is already C++ UB so the standard (and real implementations) don't provide any way to fully recover. – Peter Cordes Dec 08 '21 at 05:41
1

@PeterCordes Yeah, and 64-bit address space really gives you some space to partition your memory in a way that allows you to practically recover from _expected_ segmentation faults - nothing you'd want to do manually (or really even _could_ in a language like C++ without using assembly), but it presents some interesting opportunities for managed languages. But thinking you could ever fully recover from a segmentation fault in a language as full of undefined behaviour (and compilers routinely exploiting that) as C++ is sheer madness :D – Luaan Dec 08 '21 at 08:22
1

@Luaan: Yes, good summary. I turned those comments into [an answer](https://stackoverflow.com/questions/70258418/why-is-a-segmentation-fault-not-recoverable/70270762#70270762) which makes the same point / distinction. Even in a language like Java that has no UB, you normally shouldn't try to recover from a totally unexpected NullPointerException, since that usually means a bug messed up some data before you got to the point that faults. – Peter Cordes Dec 08 '21 at 08:24

Why is a segmentation fault not recoverable?

16 Answers16

Expected SIGSEGVs, for example a JIT sandbox

Background: what are segfaults

Segfaults usually only happen after your program encountered a bug

Linked