Why are ForEach and Parallel.For much slower than a for loop for this example?
The program is doing a bitwise-AND of two arrays and counting the number of 1-bits.
//For Approach
for (int i = 0; i < n; i++) {
countOnes += Int32.PopCount( x[i] & y[i]);
}
//Foreach approach
foreach( var (a,b) in x.Zip(y) ) {
countOnesZip += Int32.PopCount( a & b);
}
//Parallel approach
Parallel.For(0, n, i =>
{
Interlocked.Add(ref countOnesPar, Int32.PopCount(x[i] & y[i]));
});
I expected that the ForEach would be a little bit slower, but not the Parallel.For.
https://dotnetfiddle.net/aot6Z2
Results:
Processed 100,000,000 bits
Elapsed time (For): 11ms Count: 24,216,440
Elapsed time (ForEach): 96ms Count: 24,216,440
Elapsed time (Parallel.For): 107ms Count: 24,216,440
I did see this other question, but in that instance the difference was not an order of magnitude slower.
EDIT: As hinted by @Blindy below it seems Interlocked is a bottleneck. Also see this
EDIT 2: Thank you for the comments on chunkyness. Here is a version using a partitioner and a version using Tasks.. The Tasks version is more in-line with the performance I expected
EDIT 3: Here is a fiddle that includes the Task/Partitioner Versions
https://dotnetfiddle.net/LzNmzL
Tasks
var split = new int[Environment.ProcessorCount];
var tasks = new Task[split.Length];
var take = (int)Math.Round((double)x.Length / tasks.Length, MidpointRounding.AwayFromZero);
for (var i = 0; i < tasks.Length; i++)
{
var taskNumber = i;
tasks[i] = Task.Run(() =>
{
var localCount = 0;
for (var index = taskNumber * take; index < (taskNumber + 1) * take && index < x.Length; index++)
{
localCount += int.PopCount(x[index] & y[index]);
}
split[taskNumber] = localCount;
});
}
await Task.WhenAll(tasks);
countOnesPar = split.Sum();
Partitioner
Partitioner<Tuple<int, int>> partitioner = Partitioner.Create(0, n);
Parallel.ForEach(partitioner, range =>
{
int startIndex = range.Item1;
int endIndex = range.Item2;
int localCountOnes = 0;
for (int i = startIndex; i < endIndex; i++)
{
localCountOnes += Int32.PopCount(x[i] & y[i]);
}
Interlocked.Add(ref countOnesPar, localCountOnes);
});
@Marc Gravell I'll mark your comment as the correct answer if you re-post it.