We have some strange behaviors in our distributed apps. We didn't yet found why but we think it might be related to OutOfMemory errors.
However we try to follow good coding practice regarding fatal errors, such as never catching all throwable, and at max NonFatal ones. But I realized that there is something I didn't quite understand about fatal error happening in Future, and our code is pretty much all wrapped into Future at some point.
Here is a minimal example:
val f = Future{ /* some code that makes an OOM error */ }
val result = Await.result(f, 1 minutes)
What happens is
- the thread running the future code fails. The OOM is printed in stderr. But we don't see it cuz the app is deployed "somewhere" and we didn't redirect the stderr
- however, the Future doesn't end (it catches only NonFatal). Worth it doesn't free resources.
- after 1 minutes we get TimeoutException with no relation to the OOM. Hopefully, it releases resources. But we have lost time, other thread might be affected. And we then process it as a future that didn't have time to finish. Similarly as if some DB access didn't respond in time, i.e. we'll typically try again.
I found a good description of the issue here: https://github.com/scala/bug/issues/9554
My question: how should we handle fatal error happening in future?
- at least, the whole app should fail like it would if a fatal error happens in the main thread. Maybe with a core dump
- at best, have a proper management to: log the exception, apply a suitable re-execution pattern, maybe kill gracefully other running future/thread, ...
Note: this is a similar problem than Exception causes Future to never complete but the answer is "this is intended" not how to manage it