Parallel.For vs Foreach vs For Performance

Question

Why are ForEach and Parallel.For much slower than a for loop for this example?

The program is doing a bitwise-AND of two arrays and counting the number of 1-bits.

//For Approach
for (int i = 0; i < n; i++) { 
    countOnes += Int32.PopCount( x[i] & y[i]); 
}

//Foreach approach
foreach( var (a,b) in x.Zip(y) ) { 
    countOnesZip += Int32.PopCount( a & b); 
}

//Parallel approach
Parallel.For(0, n, i =>
{
    Interlocked.Add(ref countOnesPar, Int32.PopCount(x[i] & y[i]));
});

I expected that the ForEach would be a little bit slower, but not the Parallel.For.

https://dotnetfiddle.net/aot6Z2

Results:
Processed 100,000,000 bits
Elapsed time (For):            11ms Count: 24,216,440
Elapsed time (ForEach):        96ms Count: 24,216,440
Elapsed time (Parallel.For):  107ms Count: 24,216,440

I did see this other question, but in that instance the difference was not an order of magnitude slower.

EDIT: As hinted by @Blindy below it seems Interlocked is a bottleneck. Also see this

EDIT 2: Thank you for the comments on chunkyness. Here is a version using a partitioner and a version using Tasks.. The Tasks version is more in-line with the performance I expected

EDIT 3: Here is a fiddle that includes the Task/Partitioner Versions

https://dotnetfiddle.net/LzNmzL

Tasks

var split = new int[Environment.ProcessorCount];
var tasks = new Task[split.Length];
var take = (int)Math.Round((double)x.Length / tasks.Length, MidpointRounding.AwayFromZero);

for (var i = 0; i < tasks.Length; i++)
{
    var taskNumber = i;
    tasks[i] = Task.Run(() =>
    {
        var localCount = 0; 
        for (var index = taskNumber * take; index < (taskNumber + 1) * take && index < x.Length; index++)
        {
            localCount += int.PopCount(x[index] & y[index]);
        }

        split[taskNumber] = localCount;
    });
}

await Task.WhenAll(tasks);
countOnesPar = split.Sum();

Partitioner

Partitioner<Tuple<int, int>> partitioner = Partitioner.Create(0, n);
Parallel.ForEach(partitioner, range =>
{
    int startIndex = range.Item1;
    int endIndex = range.Item2;
    int localCountOnes = 0;

    for (int i = startIndex; i < endIndex; i++)
    {
        localCountOnes += Int32.PopCount(x[i] & y[i]);
    }

    Interlocked.Add(ref countOnesPar, localCountOnes);
});

@Marc Gravell I'll mark your comment as the correct answer if you re-post it.

"The program is doing a bitwise-AND of two arrays" - this screams SIMD, no? i.e. `Vector` ? there are ways to switch from a array/span of int to a span of `Vector` (via `MemoryMarshal.Cast`); as for the perf: parallelism has overhead; you normally want to make it *chunky*, i.e. instead of having `N` pieces of work, have `8` or `16` or something, where each chunk knows to process `N/8` (taking into account rounding / last segment / etc) — Marc Gravell, Apr 13 '23 at 16:55
Funny, I expected `Parallel.For` to be much slower. What possibly makes you think that synchronizing multiple threads that do a few assembly instructions each is in any way shape or form cheap? — Blindy, Apr 13 '23 at 16:56
@Blindy when I implemented this in Rust, the parallel version using Rayon was [faster](https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=8d661645651939a3e34bebd6ace65ade) — vandre, Apr 13 '23 at 17:03
Try using Linq for the iterative version as well, to match your Rust code. Just because you can write incredibly bad iterative code doesn't make parallel code better for this. — Blindy, Apr 13 '23 at 17:08
Also note that the Rust parallel version doesn't use interlocked gathering, like your C# version does, making it fundamentally broken to boot. So again, I wouldn't compare correct code with just.. that.. to draw any kind of conclusions about what should otherwise be blindingly obvious. — Blindy, Apr 13 '23 at 17:10
@Blindy fair point on using Linq for the iterative version. As for Rust, it is my understanding that it is correct code. The Rayon library authors state that Rayon will not introduce data races. The README file has a very similar [example](https://github.com/rayon-rs/rayon#parallel-iterators-and-more) that is doing a sum of squares, and it is not annotating any sort of atomicity. — vandre, Apr 13 '23 at 17:24
As mentionned by @Marc Gravell, limit the parallel processing, i.e. the number of threads equal to n in your code, to N (N = number_of_cores_of_ your_ PC * number of threads_per_core). If n exceeds N, the performance will decrease because parallelism has overhead. — Graffito, Apr 13 '23 at 18:12
also, failed interlocked operations have overhead: if you have multiple threads attempting interlocked increments *on the same value*: catastrophic retry territory; that's why if you make it chunky, you'd keep a per-thread counter locally, and then only do the interlocked add (with that value) once per thread — Marc Gravell, Apr 13 '23 at 18:51
@Graffito "_`N = number_of_cores * number_of_threads_per_core`_" - in x64 environments the `number_of_threads_per_core` value is invariably always going to be `2` (unless the user disabled SMT/HyperThreading) - but in SMT threads aren't fungible (e.g. in Intel HT, the 2nd thread [isn't as beefy](https://stackoverflow.com/questions/57872223) as the 1st thread; [Intel said it's around 30%](https://web.archive.org/web/20121019025809/http://www.intel.com/technology/itj/2002/volume06issue01/vol6iss1_hyper_threading_technology.pdf)) - so _as always_: profile/benchmark first. — Dai, Apr 13 '23 at 19:30
@Graffito Also, modern (2021+) CPUs have heterogeneous cores, such as [ARM and Apple](https://en.wikipedia.org/wiki/ARM_big.LITTLE) ([and now Intel](https://superuser.com/q/1677692/41739)) having "Efficiency cores" and "Performance cores" (remember the PS3 [Cell](https://en.wikipedia.org/wiki/Cell_(processor)) processor?)? - in which case one should _absolutely not_ use `Parallel.For` or `Task.Run` without fully introspecting available threading hardware (does .NET expose info re: per-core threading characteristics?). — Dai, Apr 13 '23 at 19:35
@Dai : I was not aware about heterogeneous cores. I never seen in .NET anything related to the threading characteristics, except the number of cores. In practice, when you split the data to be processed in small sets (e.g. 200 items for a total of 10 000), the difference of efficiency of the cores doesn't really impact on global performance. — Graffito, Apr 14 '23 at 09:32

Parallel.For vs Foreach vs For Performance

0 Answers0

Linked