I am currently optimizing data processing logic for parallel execution. I have noticed, that as the core count increases - data processing performance does not necessary increases the way I suppose it should.
Here is the test code:
Console.WriteLine($"{DateTime.Now}: Data processing start");
double lastElapsedMs = 0;
for (int i = 1; i <= Environment.ProcessorCount; i++)
{
var watch = System.Diagnostics.Stopwatch.StartNew();
ProccessData(i); // main processing method
watch.Stop();
double elapsedMs = watch.ElapsedMilliseconds;
Console.WriteLine($"{DateTime.Now}: Core count: {i}, Elapsed: {elapsedMs}ms");
lastElapsedMs = elapsedMs;
}
Console.WriteLine($"{DateTime.Now}: Data processing end");
and
public static void ProccessData(int coreCount)
{
// First part is data preparation.
// splitting 1 collection into smaller chunks, depending on core count
////////////////
// combinations = collection of data
var length = combinations.Length;
int chuncSize = length / coreCount;
int[][][] chunked = new int[coreCount][][];
for (int i = 0; i < coreCount; i++)
{
int skip = i * chuncSize;
int take = chuncSize;
int diff = (length - skip) - take;
if (diff < chuncSize)
{
take = take + diff;
}
var sub = combinations.Skip(skip).Take(take).ToArray();
chunked[i] = sub.ToArray();
}
// Second part is itteration. 1 chunk of data processed per core.
////////////////
Parallel.For(0, coreCount, new ParallelOptions() { MaxDegreeOfParallelism = coreCount }, (chunkIndex, state) =>
{
var chunk = chunked[chunkIndex];
int chunkLength = chunk.Length;
// itterate data inside chunk
for (int idx = 0; idx < chunkLength; idx++)
{
// additional processing logic here for single data
}
});
}
The results are following:
As you can see from the result set - by using 2 cores instead of 1 - you can get almost ideal increase of performance (given the fact, that 1 core is running at 4700Mhz
, but 2 cores run at 4600Mhz
each).
After that, when the data was supposed to be processed in parallel on 3 cores, I was expecting to see the performance increase by 33% when compared to 2 core execution. The actual is 21.62% increase.
Next, as the core count increases - the degradation of "parallel" execution performance continues to increase.
In the end, when we have 12 cores results - the difference between actual
and ideal
results is more than twice as big (96442ms vs 39610ms)!
I have certainly not expected the difference to be as big. I have an Intel 8700k
processor. 6 physical cores and 6 logical - total 12 Threads. 1 core running at 4700Mhz in turbo mode, 2C 4600, 3C 4500, 4C 4400, 5-6C 4400, 6C 4300.
If it matters - I have done additional observations in Core-temp
:
- when 1 core processing was running - 1 out of 6 cores was busy 50%
- when 2 core processing was running - 2 out of 6 cores were busy 50%
- when 3 core processing was running - 3 out of 6 cores were busy 50%
- when 4 core processing was running - 4 out of 6 cores were busy 50%
- when 5 core processing was running - 5 out of 6 cores were busy 50%
- when 6 core processing was running - 6 out of 6 cores were busy 50%
- when 7 core processing was running - 5 out of 6 cores were busy 50%, 1 core 100%
- when 8 core processing was running - 4 out of 6 cores were busy 50%, 2 cores 100%
- when 9 core processing was running - 3 out of 6 cores were busy 50%, 3 cores 100%
- when 10 core processing was running - 2 out of 6 cores were busy 50%, 4 cores 100%
- when 11 core processing was running - 1 out of 6 cores were busy 50%, 5 cores 100%
- when 12 core processing was running - all 6 cores at 100%
I can certainly see that the end result should not be as performant as ideal
result, because frequency per core decreases, but still.. Is there a good explanation why my code performs so badly at 12 cores? Is this generalized situation on every machine or perhaps a limitation of my PC?
.net core 2 used for tests
Edit: Sorry forgot to mention that data chunking can be optimized since I have done it as a draft solution. Nevertheless, splitting is done within 1 second time, so it's maximum 1000-2000ms added to the result execution time.
Edit2: I have just got rid of all the chunking logic and removed MaxDegreeOfParallelism
property. The data is processed as is, in parallel. The execution time is now 94196ms
, which is basically the same time as before, excluding chunking time. Seems like .net
is smart enough to chunk data during runtime, so additional code is not necessary, unless I want to limit the number of cores used. The fact is, however, this did not notably increase the performance. I am leaning towards "Ahmdahl's law" explanation since nothing of what I have done increased the performance outside of bonds of an error margin.