We had a bit of a problem. :)
We want to ensure that only N threads are doing background tasks at any time. To do this, we used a fixed thread pool executor. It seemed to be working fine.
Then we found an issue. Suppose you have a class which uses the executor to do some parallel work and then it calls some other class while in the executor thread which also does some parallel work, intending to wait on it. Here's what happens:
- Main thread calls the first level method.
- This method thinks it can parallelise into 16 tasks and splits up its work.
- 16 tasks are submitted to the executor.
- Main thread starts waiting for its tasks to complete.
- Supposing there are four threads available, the first four tasks each get picked up and run. So there are 12 tasks left on the queue.
- Now, one of these tasks calls some other method.
- This new method thinks it can parallelise into 2 tasks. Let's say it's the first step in a parallel merge sort or something along those lines.
- 2 tasks are submitted to the executor.
- This thread now starts waiting for its tasks to complete.
Uh-oh. So at this point, all four threads will now be waiting for tasks to complete but they are collaboratively blocking the executor actually running those tasks.
Solution 1 to this problem was as follows: on submitting a new task to the executor, if we are already running all our threads, and we are already running on one of the executor threads, run the task inline. This worked fine for 10 months, but now we have hit a problem with it. If the new tasks it is submitting are still relatively large, then you can get into a situation where the new task blocks the method from adding the other tasks to the queue, which would otherwise be able to be picked up by the other worker threads. So you get periods of huge delays while a thread is processing the work inline.
Is there a better solution to the core problem of executing a potentially unbounded tree of background tasks? I understand that .NET's equivalent to the executor service has some kind of in-built ability to steal from the queue which prevents the original deadlock issue from occurring, which as far as I can tell is an ideal solution. But what about over in Java land?