degradation of parallel loop performance

Question

I am currently optimizing data processing logic for parallel execution. I have noticed, that as the core count increases - data processing performance does not necessary increases the way I suppose it should.

Here is the test code:

Console.WriteLine($"{DateTime.Now}: Data processing start");
double lastElapsedMs = 0;
for (int i = 1; i <= Environment.ProcessorCount; i++)
{
    var watch = System.Diagnostics.Stopwatch.StartNew();
    ProccessData(i); // main processing method
    watch.Stop();
    double elapsedMs = watch.ElapsedMilliseconds;
    Console.WriteLine($"{DateTime.Now}: Core count: {i}, Elapsed: {elapsedMs}ms");
    lastElapsedMs = elapsedMs;
}
Console.WriteLine($"{DateTime.Now}: Data processing end");

and

public static void ProccessData(int coreCount)
{
    // First part is data preparation.
    // splitting 1 collection into smaller chunks, depending on core count
    ////////////////
    // combinations = collection of data
    var length = combinations.Length;
    int chuncSize = length / coreCount;
    int[][][] chunked = new int[coreCount][][];
    for (int i = 0; i < coreCount; i++)
    {
        int skip = i * chuncSize;
        int take = chuncSize;

        int diff = (length - skip) - take;
        if (diff < chuncSize)
        {
            take = take + diff;
        }

        var sub = combinations.Skip(skip).Take(take).ToArray();
        chunked[i] = sub.ToArray();
    }

    // Second part is itteration. 1 chunk of data processed per core.
    ////////////////
    Parallel.For(0, coreCount, new ParallelOptions() { MaxDegreeOfParallelism = coreCount }, (chunkIndex, state) =>
    {
        var chunk = chunked[chunkIndex];
        int chunkLength = chunk.Length;

        // itterate data inside chunk
        for (int idx = 0; idx < chunkLength; idx++)
        {
            // additional processing logic here for single data
        }
    });
}

The results are following:

As you can see from the result set - by using 2 cores instead of 1 - you can get almost ideal increase of performance (given the fact, that 1 core is running at 4700Mhz, but 2 cores run at 4600Mhz each).

After that, when the data was supposed to be processed in parallel on 3 cores, I was expecting to see the performance increase by 33% when compared to 2 core execution. The actual is 21.62% increase.

Next, as the core count increases - the degradation of "parallel" execution performance continues to increase.

In the end, when we have 12 cores results - the difference between actual and ideal results is more than twice as big (96442ms vs 39610ms)!

I have certainly not expected the difference to be as big. I have an Intel 8700k processor. 6 physical cores and 6 logical - total 12 Threads. 1 core running at 4700Mhz in turbo mode, 2C 4600, 3C 4500, 4C 4400, 5-6C 4400, 6C 4300.

If it matters - I have done additional observations in Core-temp:

when 1 core processing was running - 1 out of 6 cores was busy 50%
when 2 core processing was running - 2 out of 6 cores were busy 50%
when 3 core processing was running - 3 out of 6 cores were busy 50%
when 4 core processing was running - 4 out of 6 cores were busy 50%
when 5 core processing was running - 5 out of 6 cores were busy 50%
when 6 core processing was running - 6 out of 6 cores were busy 50%
when 7 core processing was running - 5 out of 6 cores were busy 50%, 1 core 100%
when 8 core processing was running - 4 out of 6 cores were busy 50%, 2 cores 100%
when 9 core processing was running - 3 out of 6 cores were busy 50%, 3 cores 100%
when 10 core processing was running - 2 out of 6 cores were busy 50%, 4 cores 100%
when 11 core processing was running - 1 out of 6 cores were busy 50%, 5 cores 100%
when 12 core processing was running - all 6 cores at 100%

I can certainly see that the end result should not be as performant as ideal result, because frequency per core decreases, but still.. Is there a good explanation why my code performs so badly at 12 cores? Is this generalized situation on every machine or perhaps a limitation of my PC?

.net core 2 used for tests

Edit: Sorry forgot to mention that data chunking can be optimized since I have done it as a draft solution. Nevertheless, splitting is done within 1 second time, so it's maximum 1000-2000ms added to the result execution time.

Edit2: I have just got rid of all the chunking logic and removed MaxDegreeOfParallelism property. The data is processed as is, in parallel. The execution time is now 94196ms, which is basically the same time as before, excluding chunking time. Seems like .net is smart enough to chunk data during runtime, so additional code is not necessary, unless I want to limit the number of cores used. The fact is, however, this did not notably increase the performance. I am leaning towards "Ahmdahl's law" explanation since nothing of what I have done increased the performance outside of bonds of an error margin.

First part (data preparation) is the same however much cores you use (you don't parallelize it), but you include it in your measurments. So if it takes non-neglibile time - it provides a lower bound on how fast the whole thing will run, even with infinite number of cores. — Evk, Apr 12 '18 at 11:00
You have some other stuff, noticably the for loop with two `.ToArray()` calls, outside the Parallel loop. Those aren't free. With unlimited cores you would eventually be measuring that part. Ahmdal indeed. — H H, Apr 12 '18 at 11:00
Yes sorry I should have mentioned it. Data chunking takes split second. the majority of execution time is affected by the parallel loop. I agree that the chunking logic is not the best. Another fact maybe, that is of importance, that there are 65 millions of combinations (which are split between chunks), but as I mentioned, data preparation is done within a 1 second time, so it's 1000ms added to the result. — Alex, Apr 12 '18 at 11:04
And processing logic for single item always takes the same amount of time (does not depend on which item is it)? — Evk, Apr 12 '18 at 11:14
Because if not - maybe some threads finish their work considerably earlier than others, so not all work has been done on 12 threads. Say it started on 12 threads, 6 of them got easy work (short combinations?) and finished faster, and after that rest of work continued on 6 threads. — Evk, Apr 12 '18 at 11:26
@Evk I just got rid of all chunks to let the .net decide what data to process when. The new result is very close to my initial result, so I can conclude that in my initial version all of the data was processed during more of less the same time span on every core. Though, the situation that you have mentioned would be a valid scenario, if I decided to push all of the data set that I have instead of just 1000 records :) Note taken. — Alex, Apr 12 '18 at 12:19
Well, by default `Parallel.For` (and `ForEach`) will chunk data in exactly the same way you did, so the fact it produces results close to yours doesn't actually tell us much (I mean what I said above about uneven distribution of workload still might or might not apply). — Evk, Apr 12 '18 at 12:23

score 1 · Accepted Answer · answered Apr 12 '18 at 10:56

1

Yeah Ahmdahl's law. Performance speedup is never linear with the number of cores thrown at the problem.

Also reciprocals...

answered Apr 12 '18 at 10:56

user234461

1,133
12
29

score 1 · Answer 2 · answered Apr 12 '18 at 11:05

Your chunking code is run once no matter how many processors you have, yet it still is dependent on the number of processors.

Especially the Skip/Take part followed by the double ToArray() call seems very much in need of optimization. See How to copy part of an array to another array in C#? on how to copy an array without traversing the whole thing multiple times.

That should do a lot for your performance coming closer to what you'd expect. That said, the work of branching out and combining the results will always degrade the performance of parallel execution. "Maximum parallelism" is not exactly something to strife for. There is always a sweet spot where the parallelization outweighs the hit from preparing for it. You need to find that. Or let .NET take care of that by leaving out the manual override for cores.

I have just got rid of all the chunking logic and removed `MaxDegreeOfParallelism` property. The data is processed as is, in parallel. The execution time is now `94196ms`. which is basically the same time as before, excluding chunking time. Seems like `.net` is smart enough to chunk data during runtime, so additional code is not necessary, unless I want to limit the number of cores used. The fact is, however, this did not notably increase the performance. I am leaning towards "Ahmdahl's law" explanation.. — Alex, Apr 12 '18 at 12:11

score 0 · Answer 3 · answered Apr 12 '18 at 11:20

As nvoigt pointed out, the chunking code is run on a single core and is slow. Look at these two lines:

    var sub = combinations.Skip(skip).Take(take).ToArray();
    chunked[i] = sub.ToArray();

SkipTake inside a loop is a Schlemiel the Painter performance issue. Use a different method

sub is already a perfectly good array, why make another copy on the next line? Array allocation isn't 0 cost.

I think ArraySegment is a good fit for this problem instead of making array copies. At the least, you can ToArray an ArraySegment more efficiently than what you're currently doing.

degradation of parallel loop performance

3 Answers3