Why is combining Task.Run and plinq so slow?

Question

I found that combining Task.Run with plinq is extremely slow so I made a simple experiment:

int scale = 32;

Enumerable.Range( 0, scale ).AsParallel().ForAll( i => {
    Enumerable.Range( 0, scale ).AsParallel().ForAll( j =>
    {
        for ( int k = 0; k < scale; k++ ) { }
    } );
} );

plinq inside plinq works well, finished in 14 milliseconds

int scale = 32;

Task[] tasks = Enumerable.Range( 0, scale ).Select( i => Task.Run( async () =>
{
    Task[] _tasks = Enumerable.Range( 0, scale ).Select( j => Task.Run( () =>
    {
        for ( int k = 0; k < scale; k++ ) { }
    } ) ).ToArray();
    await Task.WhenAll( _tasks );
} ) ).ToArray();

await Task.WhenAll( tasks );

Task inside task also ends in 14 milliseconds, but if I replace Task.Run inside with plinq like this:

int scale = 32;

Task[] tasks = Enumerable.Range( 0, scale ).Select( i => Task.Run( () =>
{
    Enumerable.Range( 0, scale ).AsParallel().ForAll( j =>
    {
        for ( int k = 0; k < scale; k++ ) { }
    } );
} ) ).ToArray();

await Task.WhenAll( tasks );

It'll take 29 seconds to execute. Things get worse if scale variable is larger.

Can anyone explain what happened in this case?

Edit:

I made another experiment:

static async Task Main( string[] args )
{
    Stopwatch stopwatch = Stopwatch.StartNew();

    int scale = 8;

    Task[] tasks = Enumerable.Range( 0, scale ).Select( id => Run( scale, id ) ).ToArray();

    await Task.WhenAll( tasks );

    Console.WriteLine( $"ElapsedTime={stopwatch.ElapsedMilliseconds}ms" );
}

static Task Run( int scale, int id )
{
    return Task.Run( () =>
    {
        Enumerable.Range( 0, scale ).AsParallel().ForAll( j =>
        {
            for ( int k = 0; k < scale; k++ )
            {

            }

            Console.WriteLine( $"[{DateTimeOffset.Now.ToUnixTimeMilliseconds()}]Task {id} for loop {j} end" );
        } );
    } );
}

And here is part of the result:

[1557475215796]Task 0 for loop 6 end
[1557475215796]Task 0 for loop 7 end
[1557475216776]Task 4 for loop 0 end
[1557475216776]Task 4 for loop 1 end
[1557475216777]Task 4 for loop 2 end
[1557475216777]Task 4 for loop 3 end
[1557475216778]Task 4 for loop 4 end
[1557475216778]Task 4 for loop 5 end
[1557475216779]Task 4 for loop 6 end
[1557475216780]Task 4 for loop 7 end
[1557475217774]Task 5 for loop 0 end
[1557475217774]Task 5 for loop 1 end
[1557475217775]Task 5 for loop 2 end

Look into the timestamp between each tasks,you can find there is a mysterious 1000 milliseconds delay whenever it move to next task. I guess there is a mechanism in plinq or task that will pause for one second in some situation which slow down the process significantly.

Thanks to the explanation of @StephenCleary, now I understand the delay comes from the creation of thread. I tweak my experiment again and found that ForAll method will block the task until all other ForAll method in different tasks are completed.

static Task Run( int scale, int id )
{
    return Task.Run( () =>
    {
        Enumerable.Range( 0, scale ).AsParallel().ForAll( j =>
        {
            for ( int k = 0; k < scale; k++ )
            {

            }

            Console.WriteLine( $"[{DateTimeOffset.Now.ToUnixTimeMilliseconds()}]Task {id} for loop {j} end, thread count = {Process.GetCurrentProcess().Threads.Count}" );
        } );
        Console.WriteLine( $"[{DateTimeOffset.Now.ToUnixTimeMilliseconds()}]Task {id} finished" );
    } );
}

And the result is :

[1557478553656]Task 6 for loop 6 end, thread count = 19
[1557478553657]Task 6 for loop 7 end, thread count = 19
[1557478554645]Task 7 for loop 0 end, thread count = 20
[1557478554647]Task 7 for loop 1 end, thread count = 20
[1557478554649]Task 7 for loop 2 end, thread count = 20
[1557478554651]Task 7 for loop 3 end, thread count = 20
[1557478554653]Task 7 for loop 4 end, thread count = 20
[1557478554655]Task 7 for loop 5 end, thread count = 20
[1557478554657]Task 7 for loop 6 end, thread count = 20
[1557478554659]Task 7 for loop 7 end, thread count = 20
[1557478555644]Task 1 finished
[1557478555644]Task 0 finished
[1557478555644]Task 3 finished
[1557478555644]Task 2 finished
[1557478555644]Task 4 finished
[1557478555644]Task 6 finished
[1557478555644]Task 5 finished
[1557478555644]Task 7 finished

I expect that ForAll method should return immediately. Why is it block the task and the thread?

I think this will answer your question: https://stackoverflow.com/questions/19102966/parallel-foreach-vs-task-run-and-task-whenall. — vsarunov, May 10 '19 at 06:02
Difference here is due to using Async in case one set of `Task.Run`, which hives off the processing for the internal `Task.Run`, but while using `PLinq` inside `Task.Run`, it is not feasible to do the same hive off, since you are blocking each and every outer Task, though you may want to use Async wrapper over sync, but will be lot of effort, not really worth it — Mrinal Kamboj, May 10 '19 at 07:30
Neither is slow. This code is problematic though. First of all, using `Task.Run` inside `PLINQ` or `Parallel.ForEach` is pointless. All CPU cores are *already* used to process the input data. Any new task spawned using `Task.Run` will have to wait until the OS scheduler suspends the PLINQ threads and starts executing the new tasks — Panagiotis Kanavos, May 10 '19 at 08:56
In the first two snippets, the nested loops harm performance. A CPU can't run more tasks at the same time than there are cores, so extra tasks will have to wait to get scheduled. PLINQ and `Parallel.ForEach` are meant for data parallelism, where you want all cores to work in parallel to process a large amount of data. PLINQ does this by creating as many Tasks as there are cores, partitioning the input data and feeding each partition to the worker tasks. — Panagiotis Kanavos, May 10 '19 at 08:59
Finally, your `scale` is far too small. You're trying to run 32^3 additions which is just 32K iterations. What you measure in each case is the parallelization overhead of PLINQ and tasks, not actual processing times. You'd get more meaningful results if you created a *really* long sequence and tried to do something expensive, like summing square roots eg: `var total=Enumerable.Range(1,100000000).AsParallel().Select(i=>Math.Sqrt(i)).Sum();` — Panagiotis Kanavos, May 10 '19 at 09:04
In fact, if you *remove* `AsParallel()` from your original snippet, the time is just *1ms* — Panagiotis Kanavos, May 10 '19 at 09:10
@PanagiotisKanavos Thanks you for your comment. This is only an experiment code. I'm not really using nested forloop with plinq or Task in my code. The real problem I have is a plinq + task usage break some of our service and I want to know why is it happened in detail. — Leisen Chang, May 10 '19 at 09:13
@LeisenChang you have to post the *actual* code and explain the original problem then because *this* code doesn't demonstrate anything except thread and CPU starvation. Again, PLINQ is meant for *data parallelism*. If you use it for anything else, eg to spawn other tasks, you're causing problems. — Panagiotis Kanavos, May 10 '19 at 09:14
@PanagiotisKanavos We have solved the problem in our actual code. I'm more interesting in how thread and CPU starvation happen when I combine Task and plinq. — Leisen Chang, May 10 '19 at 09:16
@LeisenChang I explained why. PLINQ will create as many partitions as there are cores and process each one on a separate core. That's A Very Good Thing for data paralellism. If you try to spawn *other* threads though, they'll have to wait for their turn to run on the CPU — Panagiotis Kanavos, May 10 '19 at 09:18
@LeisenChang nesting PLINQ calls in the first snippet has this problem as well, which is why instead of 0-1ms, that snippet takes 3 ms on my machine — Panagiotis Kanavos, May 10 '19 at 09:19
@PanagiotisKanavos Do you know why all `ForAll` method return together in last case? — Leisen Chang, May 10 '19 at 09:20
@LeisenChang you misunderstood what Stephen Cleary wrote. You're still firing off tasks when the `PLINQ` calls use all cores. As for the timings, they're meaningless because the data is far too small. You're only measuring the runtime's overhead. The only thing this proves is that you shouldn't write such code — Panagiotis Kanavos, May 10 '19 at 09:24
Also, the conversion to milliseconds eliminates any differences. DateTime is counted in ticks. There are 10000 tickes per millisecond and the code you run finishes in *less* than 1 ms. You could try printing `DateTimeOffset.Now.Ticks` but again, the intervals are so small that different calls to `DateTime.Now` can easily return the *same* value — Panagiotis Kanavos, May 10 '19 at 09:28

score 3 · Accepted Answer · answered May 10 '19 at 07:25

Issue is clearly in your code, let's review various code snippets, especially the ones using Task, since the PLinq inside PLinq is straight forward that is pretty much using all possible Threads / cores to process as fast as possible, there would not be much context shifting since processing is in memory and quick. Infact PLinq itself will manage / control the number of Parallel invoke unlike Task.Run which is relatively independent.

Snippet 1

int scale = 32;

Task[] tasks = Enumerable.Range( 0, scale ).Select( i => Task.Run( async () =>
{
    Task[] _tasks = Enumerable.Range( 0, scale ).Select( j => Task.Run( () =>
    {
        for ( int k = 0; k < scale; k++ ) { }
    } ) ).ToArray();
    await Task.WhenAll( _tasks );
} ) ).ToArray();

await Task.WhenAll( tasks );

Here your complete processing is in memory and every outer Task, asynchronously schedule the internal loops, while Task itself will not block a thread and wait for internal Tasks to complete, so outer Task.Run, will be notified asynchronously when internal Task.Run are completed

Now what happens in the slower code, let's review

Snippet 2

int scale = 32;

Task[] tasks = Enumerable.Range( 0, scale ).Select( i => Task.Run( () =>
{
    Enumerable.Range( 0, scale ).AsParallel().ForAll( j =>
    {
        for ( int k = 0; k < scale; k++ ) { }
    } );
} ) ).ToArray();

await Task.WhenAll( tasks );

Here each Task.Run doesn't asynchronously hand over the request to inner PLinq call and what would happen is thread invoked by Task.Run would be blocked for inner PLinq to be completed and that is main source of issue out here, thus leading to high contention.

As explained above there's a substantial difference between how a Task.Run invoking PLinq is different from PLinq invoking PLinq, so the key lies in understanding how these different APIs work individually and what is the impact of combining them to work together as your code expects.

Thanks for your answer. I wonder why is it 29 seconds? It's 2000 times more than 14 milliseconds. Since there's only a simple for loop inside,this result sounds unreasonably high. I guess there is an unknown overhead inside plinq or Task in this case. — Leisen Chang, May 10 '19 at 07:44
Welcome @LeisenChang, this is a tricky part, since there are so many factors which may impact, like server configuration, let's assume you are using a 32 core system, in the slow one there will 32 tasks ready to hog them in the outside loop, but you can do some simple testing, try reducing the outer scale number and keep the internal same as 32, may be start with lower value like 2 then 4, 6, 8, this will help you know how and when performance starts degrading at a outer loop / Enumerable value is increased. Simply put blocking at outer level may exponentially impact the performance — Mrinal Kamboj, May 10 '19 at 07:59
I've edited my post. I found that there is a strange delay between the execution of each task. Do you know why? — Leisen Chang, May 10 '19 at 08:09
@LeisenChang: Because the thread pool has a limited thread injection rate. — Stephen Cleary, May 10 '19 at 08:43
@LeisenChang also because a CPU can't run more threads than there are cores at the same time. PLINQ/Parallel.Foreach use all available cores to process input data which means any new tasks will have to wait until the OS suspends the PLINQ threads and starts the tasks's threads — Panagiotis Kanavos, May 10 '19 at 09:08
@LeisenChang as mentioned by Stephen and Panagiotis, you are pretty much exploiting complete thread pool, which is not in your control as you try to use the Parallel APIs, especially the `Task.Run`, which is invoking a non async Task. Please try the test case that I have suggested, reduce the outer scale and see how it works then, this shall be inline with machine configuration. Also please note even the way async code is designed, if its not a genuine async and actual processing is in the memory, then Async benefits will not be there, since genuine Async processing shall be in background (IO) — Mrinal Kamboj, May 10 '19 at 09:23

Why is combining Task.Run and plinq so slow?

1 Answers1