What should a CoroutineExceptionHandler do with an OutOfMemoryError or other fatal error?

Question

I'm implementing a custom Kotlin CoroutineScope that deals with receiving, handling and responding to messages over a WebSocket connection. The scope's lifecycle is tied to the WebSocket session, so it's active as long as the WebSocket is open. As part of the coroutine scope's context, I've installed a custom exception handler that will close the WebSocket session if there's an unhandled error. It's something like this:

val handler = CoroutineExceptionHandler { _, exception -> 
    log.error("Closing WebSocket session due to an unhandled error", exception)
    session.close(POLICY_VIOLATION)
}

I was surprised to find that the exception handler doesn't just receive exceptions, but is actually invoked for all unhandled throwables, including subtypes of Error. I'm not sure what I should do with these, since I know from the Java API documentation for Error that "an Error [...] indicates serious problems that a reasonable application should not try to catch".

One particular situation that I ran into recently was an OutOfMemoryError due to the amount of data being handled for a session. The OutOfMemoryError was received by my CoroutineExceptionHandler, meaning it was logged and the WebSocket session was closed, but the application continued running. That makes me uncomfortable, because I know that an OutOfMemoryError can be thrown at any point during code execution and as a result can leave the application in a irrecoverable state.

My first question is this: why does the Kotlin API choose to pass these errors to the CoroutineExceptionHandler for me, the programmer, to handle?

And my second question, following directly from that, is: what is the appropriate way for me to handle it? I can think of at least three options:

Continue to do what I'm doing now, which is to close the WebSocket session where the error was raised and hope that rest of the application can recover. As I said, that makes me uncomfortable, particularly when I read answers like this one, in response to a question about catching OutOfMemoryError in Java, which recommends strongly against trying to recover from such errors.
Re-throw the error, letting it propagate to the thread. That's what I would normally do in any other situation where I encounter an Error in normal (or framework) code, on the basis that it will eventually cause the JVM to crash. In my coroutine scope, though, (as with multithreading in general), that's not an option. Re-throwing the exception just ends up sending it to the thread's UncaughtExceptionHandler, which doesn't do anything with it.
Initiate a full shutdown of the application. Stopping the application feels like the safest thing to do, but I'd like to make sure I fully understand the implications. Is there any mechanism for a coroutine to propagate a fatal error to the rest of the application, or would I need to code that capability myself? Is propagation of 'application-fatal' errors something the Kotlin coroutines API designers have considered, or might consider in a future release? How do other multithreading models typically handle these kinds of errors?

(I don't know much about coroutines, but from general Java development I think your caution is absolutely justified! It depends on the nature of the app, but I found the best solution was to shut down the app immediately. The system would then restart it, so it would continue with only a shortish delay. Otherwise, it might limp on indefinitely, potentially missing some vital processing; or suffer cascading failures; or grind to a halt and become unresponsive.) — gidds, Aug 22 '20 at 13:12
Of course, immediate shutdown isn't trivial. First, you need to make sure it happens on every thread, even system ones; you can set a default UncaughtExceptionHandler for that. In that, log the error if needed — taking create care to trap any further Errors that occur in the process. Then shut down the app immediately with Runtime.halt(). (Unlike Runtime.exit(), that doesn't run shutdown hooks and finalisers first; you probably don't want those, as they may not work properly in a low-memory environment, and either way they may take a long time to run.) — gidds, Aug 22 '20 at 13:18

score 4 · Accepted Answer · answered Aug 24 '20 at 21:48

Why does the Kotlin API choose to pass these errors to the CoroutineExceptionHandler for me, the programmer, to handle?

The Kotlin docs on exceptions state:

All exception classes in Kotlin are descendants of the class Throwable.

So it seems the Kotlin documentation uses the term exception for all kinds of Throwable, including Error.

Whether an exception in a coroutine should be propagated is actually a result of choosing the coroutine builder (cf. Exception propagation):

Coroutine builders come in two flavors: propagating exceptions automatically (launch and actor) or exposing them to users (async and produce).

If you receive unhandled exceptions at the WebSocket scope it indicates a non-recoverable problem down the call chain. Recoverable exceptions are expected to be handled at the closest possible invocation level. So it is quite natural that you don't know how to respond at the WebSocket scope and indicates a problem with the code you are invoking.

The coroutine functions then choose the safe path and cancel the parent job (which includes cancelling its child jobs), as stated in Cancellation and exceptions:

If a coroutine encounters an exception other than CancellationException, it cancels its parent with that exception. This behaviour cannot be overridden and is used to provide stable coroutines hierarchies for structured concurrency.
What is the appropriate way for me to handle it?

In any case: Try to log it first (as you do already). Consider to provide as much diagnostic data as feasible (including a stack trace).

Remember that the coroutines library has already cancelled jobs for you. In many cases, this would be just good enough. Don't expect the coroutines library to do more than this (not now, not in a future release). It does not have the knowledge to do better. The application server typically provides a configuration for exception handling, e.g. as in Ktor.

Beyond that, it depends, and may involve heuristics and trade-offs. Don't blindly follow "best practices". You know your application's design and requirements better than others. Some aspects to consider:
- For efficient operations, restore impacted services automatically and as quickly and seamlessly as reasonable. Sometimes the easy way (shutting down and restarting everything that might be affected) is good enough.
- Evaluate the impact of recovering from an unknown state. Is it just a minor glitch, which is easily noticed or do people's lives depend on the outcome? In case of uncaught exceptions: Is the application designed in a way that resources are released and transactions rolled back? Can dependent systems continue unaffectedly?
- If you have control over functions called, you might introduce a separate exception class (hierarchy) for recoverable exceptions (which have only a transitory and non-damaging effect) and treat them differently.
- When trying to recover a partially working system, consider a staged approach and handle follow-up failures:
  - If it is sufficient to shut down your coroutines only, leave it at that. You might even keep the WebSocket session open and send a restart indication message to the client. Consider the chapter on Supervision in the Kotlin coroutines documentation.
  - If that would be unsafe (or a follow-up error occurs), consider shutting down the thread. This would not be relevant with coroutines dispatched to different threads, but a proper solution for systems without inter-thread coupling.
  - If that would still be unsafe (or a follow-up error occurs), shut down the entire JVM. It all may depend on the exception's underlying cause.
- If your application modifies persistent data, make sure it is crash-proof by design (e.g. via atomic transactions or other automatic recovery strategies).
- If a design goal of your entire application is to be crash-proof, consider a crash-only software design instead of (possibly complex) shutdown procedures.
- In case of an OutOfMemoryError, if the cause was a singularity (e.g. one giant allocation), recovery could proceed in stages as described above. On the other hand, if the JVM cannot even allocate tiny bits, forcibly terminating the JVM via Runtime.halt() might prevent cascading follow-up errors.

What should a CoroutineExceptionHandler do with an OutOfMemoryError or other fatal error?

1 Answers1