Performances of PLINQ vs TPL

Question

I have some DB operations to perform and I tried using PLINQ:

someCollection.AsParallel()
              .WithCancellation(token)
              .ForAll(element => ExecuteDbOperation(element))

And I notice it is quite slow compared to:

var tasks = someCollection.Select(element =>
                                    Task.Run(() => ExecuteDbOperation(element), token))
                          .ToList()

await Task.WhenAll(tasks)

I prefer the PLINQ syntax, but I am forced to use the second version for performances.

Can someone explain the big difference in performances?

How many elements contains `someCollection` and what is the average time for `ExecuteDbOperation` operation ? — Disappointed, May 10 '16 at 09:02
I think 4-10 and I haven't stopwatch-ed ExecuteDbOperation, but it's an IO operation of course... It would also be nice to have some guidelines on threshold for usages of PLINQ/TPL like: 10-1000 elements and 1s operation prefer PLINQ and 1000-10000 elements and 50ms operation prefer TPL if anyone has bench-marked it... — Stefano d'Antonio, May 10 '16 at 09:18
As a sidenote: be aware that using AsParallel blocks the UI thread whereas using await Task.WhenAll does not. — Peter Bons, May 10 '16 at 13:52
@PeterBons - what makes you say that? We don't know what this is being called from, and PLINQ shouldn't independently be blocking the UI thread. `AsParallel()` itself does no work, just creates a statement of intent. PLINQ doesn't create tasks on the current thread, it uses the 'Default' thread-pool scheduler (IIRC). Now, it _can_ take a while to run, but that's hardly "blocking" the thread... — Clockwork-Muse, May 10 '16 at 22:49
@Clockwork-Muse I was referring to the fact that for all I know, AsParallel.ForAll() is not awaitable so the calling thread is blocked until all the work is done. Even though that work might be done on different threads. — Peter Bons, May 11 '16 at 08:06
@svick we will be making it async at some point, but I'm still curious about this case. — Stefano d'Antonio, May 11 '16 at 08:41
@PeterBons Clockwork-Muse functionally there is no problem, I was curious about performances. If you want to make it async with AsParallel() you can just call await Task.Yield() before the query execution. — Stefano d'Antonio, May 11 '16 at 08:45

score 4 · Answer 1 · edited May 10 '16 at 10:23

4

My supposition that this is because of the number of threads created.

In the first example this number will be roughly equal to the number of cores of your computer. By contrast, the second example will create as many threads as someCollection has elements. For IO operation that's generally more efficient.

The Microsoft guide "Patterns_of_Parallel_Programming_CSharp" recommends for IO operation to create more threads than default (p. 33):

var addrs = new[] { addr1, addr2, ..., addrN };
var pings = from addr in addrs.AsParallel().WithDegreeOfParallelism(16)
select new Ping().Send(addr);

edited May 10 '16 at 10:23

Peter - Reinstate Monica

15,048
4
37
62

answered May 10 '16 at 09:26

Disappointed

1,100
1
9
21

1

I was thinking about partitioning, it's a good test to specify the parallelism in the query and see how it goes, but in my case it should be only 4-10 elements running on an hyperthreading 8 core machine so I'm not completely sure. – Stefano d'Antonio May 10 '16 at 09:32
@Uno There is not any partitioning in the second example – Disappointed May 10 '16 at 09:52
The I/O thing is probably complicated. The ping example in the book works nicely because it's a distributed query (asked to many machines) with high latency. By contrast, single disk I/O may not profit much from many threads or even degrade: when reading many files, each of which is unfragmented, the I/O is maxed out with a single operation; but each thread switch may incur a head repositioning penalty. (Old school rotating disks assumed...) But similar effects may happen with a DB server which has to swap between different tables queried in parallel. – Peter - Reinstate Monica May 10 '16 at 10:22
@Disappointed Exactly, so the partitioning in this case might actually be bad for the performances, but I pointed out that if PLINQ creates a thread for each logical processor, the number of threads it's likely to be the same as the second example. Thanks for the link, seems interesting, I'll have a look later. – Stefano d'Antonio May 10 '16 at 13:51

score 4 · Answer 2 · edited May 23 '17 at 10:28

Both PLINQ and Parallel.ForEach() were primarily designed to deal with CPU-bound workloads, which is why they don't work so well for your IO-bound work. For some specific IO-bound work, there is an optimal degree of parallelism, but it doesn't depend on the number of CPU cores, while the degree of parallelism in PLINQ and Parallel.ForEach() does depend on the number of CPU cores, to a greater or lesser degree.

Specifically, the way PLINQ works is to use a fixed number of Tasks, by default based on the number of CPU cores on your computer. This is meant to work well for a chain of PLINQ methods. But it seems this number is smaller than the ideal degree of parallelism for your work.

On the other hand Parallel.ForEach() delegates deciding how many Tasks to run to the ThreadPool. And as long as its threads are blocked, ThreadPool slowly keeps adding them. The result is that, over time, Parallel.ForEach() might get closer to the ideal degree of parallelism.

The right solution is to figure out what the right degree of parallelism for your work is by measuring, and then using that.

Ideally, you would make your code asynchronous and then use some approach to limit the degree of parallelism fro async code.

Since you said you can't do that (yet), I think a decent solution might be to avoid the ThreadPool and run your work on dedicated threads (you can create those by using Task.Factory.StartNew() with TaskCreationOptions.LongRunning).

If you're okay with sticking to the ThreadPool, another solution would be to use PLINQ ForAll(), but also call WithDegreeOfParallelism().

score 3 · Answer 3 · answered May 10 '16 at 09:37

3

I belive if you get let say more then 10000 elements it will be better to use PLINQ because it won't create task for each element of your collection because it uses a Partitioner inside it. Each task creation has some overhead data initialization inside it. Partitioner will create only as many tasks that are optimized for currently avaliable cores, so it will re-use this tasks with new data to process. You can read more about it here: http://blogs.msdn.com/b/pfxteam/archive/2009/05/28/9648672.aspx

answered May 10 '16 at 09:37

Jacob Sobus

961
16
25

Well, sort of. It creates the tasks regardless, but there's no guarantee that they'll actually get used/fed data. – Clockwork-Muse May 10 '16 at 11:09
I'm not an expert but this will work better all in all for more elements @Clockwork-Muse? – Jacob Sobus May 10 '16 at 19:29
1

The answer for a lot of this stuff is "it depends", and requires profiling. Even if you're creating thousands of tasks, almost all runtime, GC'd languages have good performance for creating small objects (because it's a common thing). You also get into weird situations because data locality ends up mattering in strange ways when you hit the hardware level. – Clockwork-Muse May 10 '16 at 22:42
True, so the valid answer is: "it depends - profile and decide" ;) – Jacob Sobus May 11 '16 at 07:43

Performances of PLINQ vs TPL

3 Answers3