From this question " Will inner parallel streams be processed fully in parallel before considering parallelizing outer stream?", I understood that streams perform work-stealing. However, I've noticed that it often doesn't seem to occur. For example, if I have a List of say 100,000 elements and I attempt to process it in parallelStream() fashion, I often notice towards the end that most of my CPU cores are sitting idle in the "waiting" state. (Note: Of the 100,000 elements in the list, some elements take a long time to process, whereas others are fast; and, the list is not balanced, which is why some threads may get "unlucky" and have lots to do, whereas others get lucky and have little to do).
So, my theory is that JIT compiler does an initial division of the 100,000 elements into the 16 threads (because I have 16 cores), but then within each thread, it just does a simple (sequential) for-loop (as that would be the most efficient) and therefore no work stealing would ever occurr (which is what I'm seeing).
I think the reason why Will inner parallel streams be processed fully in parallel before considering parallelizing outer stream? showed work stealing is that there was an OUTER loop that was streaming and an INNER LOOP that was streaming, and so in that case, each inner loop got evaluated at run time and would create new tasks that could, at runtime, be assigned to "idle" threads. Thoughts? Is there something I'm doing wrong that would "force" a simple list.parallelStream() to use work-stealing? (My current workaround is to attempt to balance the list based on various heurestics so that each thread sees, usually, the same amount of work; but, it's hard to predict that....)