61

I'm aware that StackOverflowExceptions in .NET can't be caught, take down their process, and have no stack trace. This is officially documented on MSDN. However, I'm wondering what the technical (or other) reasons are behind the behavior. All MSDN says is:

In prior versions of the .NET Framework, your application could catch a StackOverflowException object (for example, to recover from unbounded recursion). However, that practice is currently discouraged because significant additional code is required to reliably catch a stack overflow exception and continue program execution.

What is this "significant additional code"? Are there other documented reasons for this behavior? Even if we can't catch SOE, why can't we at least get a stack trace? Several co-workers and I just sunk several hours into debugging a production StackOverflowException that would have taken minutes with a stack trace, so I'm wondering if there is a good reason for my suffering.

tckmn
  • 57,719
  • 27
  • 114
  • 156
ChaseMedallion
  • 20,860
  • 17
  • 88
  • 152
  • 22
    "there's no more free space on the stack. Quick, put the necessary extra data on the stack to enable us to throw the exception, have it record relevant information, and to find and call the appropriate handler" – jalf Mar 17 '14 at 21:18
  • 8
    @jalf That's actually the easiest aspect of this problem to overcome (the RT could simply set a "soft limit" just shy of the actual stack size, so that it's guaranteed to have enough left over if a soft-overflow occurs). – TypeIA Mar 17 '14 at 21:21
  • @HansPassant You should _seriously_ consider answering the question with that info. Awesome stuff. – julealgon Mar 17 '14 at 23:03
  • 3
    Am I the only one worrying that asking about stack overflows on, well, stackoverflow, could cause the universe to collapse? – fgp Mar 18 '14 at 01:46
  • 1
    Related Java questions: [Why does this method print 4?](http://stackoverflow.com/q/17828584/7586) and [Understanding java stack](http://stackoverflow.com/q/15083318/7586) - Show what might happen when you do recover from a StackOverflow - additional method calls cause additional stack overflow errors, not a good state to be in. Of course, Java is not .Net, but I think it is interesting. – Kobi Mar 18 '14 at 07:05

3 Answers3

86

The stack of a thread is created by Windows. It uses so-called guard pages to be able to detect a stack overflow. A feature that's generally available to user mode code as described in this MSDN Library article. The basic idea is that the last two pages of the stack (2 x 4096 = 8192 bytes) are reserved and any processor access to them triggers a page fault that's turned into an SEH exception, STATUS_GUARD_PAGE_VIOLATION.

This is intercepted by the kernel in the case of those pages belonging to a thread stack. It changes the protection attributes of the first of those 2 pages, thus giving the thread some emergency stack space to deal with the mishap, then re-raises a STATUS_STACK_OVERFLOW exception.

This exception is in turn intercepted by the CLR. At that point there's about 3 kilobytes of stack space left. This is, for one, not enough to run the Just-in-time compiler (JITter) to compile the code that could deal with the exception in your program, the JITter needs much more space than that. The CLR therefore cannot do anything else but rudely abort the thread. And by .NET 2.0 policy that also terminates the process.

Note how this is less of a problem in Java, it has a bytecode interpreter so there's a guarantee that executable user code can run. Or in a non-managed program written in languages like C, C++ or Delphi, code is generated at build time. It is however still a very difficult mishap to deal with, the emergency space in the stack is blown so there is no scenario where continuing to run code on the thread is safe to do. The likelihood that a program can continue operating correctly with a thread aborted at a completely random location and rather corrupted state is quite unlikely.

If there was any effort at all in considering raising an event on another thread or in removing the restriction in the winapi (the number of guard pages is not configurable) then that's either a very well-kept secret or just wasn't considered useful. I suspect the latter, don't know it for a fact.

Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
  • 5
    +1 for this "The likelihood that a program can continue operating correctly with a thread aborted at a completely random location and rather corrupted state is quite unlikely" alone. If people would only get this - regardless of the circumstances that lead to such a situation. – Christian.K Mar 18 '14 at 05:42
  • 1
    This is exactly the kind of information I was looking for. Any idea why we can't at least get a stack trace with the exception? – ChaseMedallion Mar 18 '14 at 11:36
  • But you can, a debugger has no trouble showing you one. Which works fine, it is a separate process so there's little danger of blowing the remaining 7KB stack, invoking the debugger break is cheap. Preemptively: do keep in mind that logging is never a subtle implementation detail. – Hans Passant Mar 18 '14 at 11:51
  • 2
    I don't buy it. The OP starts with a quote indicating that it used to be possible, so the real question is what changed. – Gabe Mar 18 '14 at 16:10
  • 3
    Hmya, the CLR v1.x policy of just letting the thread die without an AppDomain.UnhandledException callback was widely despised. Got the Windows group at Microsoft to hate managed code so heavily when they tried to use it in Longhorn. – Hans Passant Mar 18 '14 at 16:37
  • One might think that the CLR could attempt to recover by unwinding the stack to the first catch frame before invoking it. If that restores sufficient stack to JIT the code needed for the catch block, then recover gracefully; otherwise unwind it the rest of the way and let the default error handler deal with it. But then again, I'm not on the CLR team, so what do I know? – Eric Lloyd Mar 18 '14 at 18:39
  • 2
    Also, it might be an interesting exercise to explore how Mono handles it. – Eric Lloyd Mar 18 '14 at 18:40
16

The stack is where virtually everything about the state of a program is stored. The address of each return site when methods are called, local variables, method parameters, etc. If a method overflows the stack, its execution must, by necessity, stop immediately (since there is no more stack space left for it to continue running). Then, to gracefully recover, somebody needs to clean up whatever that method did to the stack before it died. This means knowing what the stack looked like before the method was called. This incurs some overhead.

And if you can't clean up the stack, then you can't get a stack trace either, because the information required to generate the trace comes from "unrolling" the stack to discover which methods were called.

TypeIA
  • 16,916
  • 1
  • 38
  • 52
  • But is it not so, that in the managed environment the execution stack is somehow *wrapped* too, so that if the space is violated, then the managed environment can handle this gracefully (and possibly not so costly)? – keenthinker Mar 17 '14 at 21:25
  • 1
    @pasty The stack is not wrapped. Storage for managed stacks is allocated and committed when the thread is created. There's no option to extend this at run-time. – Brian Rasmussen Mar 17 '14 at 21:27
  • 1
    Why not just destroy the current thread that overflowed the stack? Why does the whole process need to be killed? – Bob Albright Mar 17 '14 at 21:27
  • 1
    @BobAlbright Because doing so would mean that the process would be in an undefined state. – Brian Rasmussen Mar 17 '14 at 21:28
  • 1
    Java handles this... I don't see the reason why .NET can't do the same. I also don't see why it can't back off to the last non-overflow stack frame and proceed normally (for an exception) from there. – ChaseMedallion Mar 17 '14 at 21:42
  • @ChaseMedallion .NET *could*; they have *decided* not to. – TypeIA Mar 17 '14 at 21:43
  • 1
    @dvnrrs: this is the essence of my question. WHY did .NET decide not to handle SO gracefully? – ChaseMedallion Mar 17 '14 at 21:44
  • @ChaseMedallion I'm not a Microsoft employee so this is speculation, but my guess is they decided that the cost in performance/complexity was too high, or, quite likely, their implementation was buggy or had unexpected corner cases, and they decided to simply drop it entirely. – TypeIA Mar 17 '14 at 21:47
  • 1
    @ChaseMedallion I highly recommend you go read ["How many Microsoft employees does it take to change a lightbulb?"](http://blogs.msdn.com/b/ericlippert/archive/2003/10/28/53298.aspx) by [Eric Lippert](http://stackoverflow.com/users/88656/eric-lippert) (who was a principal developer on the C# compiler team for Microsoft). It explains very well the answer to "Why did they not do this easy extra feature?" – Scott Chamberlain Mar 17 '14 at 22:28
  • @ChaseMedallion: I think "handle SO gracefully" is an oxymoron. A stack overflow isn't a graceful thing. How is the calling code meant to handle this? Retry but try and execute less code this time? – Phoshi Mar 18 '14 at 10:05
  • @Phoshi: in the case of a web application, a great way to handle an SO exception would be to proceed through the normal error handling logic (which in our case logs information useful for debugging), and, since as you say it's unlikely that the thread will be able to truly handle the SO, let the request die and return an error page to the user. – ChaseMedallion Mar 18 '14 at 10:25
7

To handle stack overflow or out-of-memory conditions gracefully, it is necessary to trigger an exception somewhat before the stack has actually overflowed or heap memory is totally exhausted, at a time when the available stack and heap resources will be adequate to execute any cleanup code that will need to run before the exceptions are caught. In the case of stack-overflow exceptions, handling them cleanly would basically require checking the stack pointer on entry to each method (which shouldn't really be all that expensive). Normally, they're handled by setting an access-violation trap just beyond the end of the stack, but the problem with doing that is that the trap won't fire until it's already too late to handle things cleanly. One could set the trap to fire on the last memory block of the stack, rather than the one past, and have the system change the trap to the block past the stack once it fires and triggers a StackOverflowException, but the problem is there would be no nice way to ensure that the "almost out of stack" trap got re-enabled once the stack had unwound that far.

That having been said, an alternative approach would be to allow threads to set a delegate for what should happen if the thread blows its stack, and then say that in case of StackOverflowException the thread's stack will be cleared and it will run the supplied delegate. The trap could be re-instated before running the delegate (the stack would be empty at that point), and code could maintain a thread-status object that the delegate could use to know whether any important finally blocks got skipped.

supercat
  • 77,689
  • 9
  • 166
  • 211