Why isn't Parallel.For fast with heap-intensive operations?

Question

For some operations Parallel scales well with the number of CPU's, but for other operations it does not.

Consider the code below, function1 gets a 10x improvement while function2 gets a 3x improvement. Is this due to memory allocation, or perhaps GC?

void function1(int v) {
    for (int i = 0; i < 100000000; i++) {
        var q = Math.Sqrt(v);
    }
}
void function2(int v) {
    Dictionary<int, int> dict = new Dictionary<int, int>();
    for (int i = 0; i < 10000000; i++) {
        dict.Add(i, v);
    }
}
var sw = new System.Diagnostics.Stopwatch();

var iterations = 100;

sw.Restart();
for (int v = 0; v < iterations; v++) function1(v);
sw.Stop();
Console.WriteLine("function1 no parallel: " + sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));

sw.Restart();
Parallel.For(0, iterations, function1);
sw.Stop();
Console.WriteLine("function1 with parallel: " + sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));

sw.Restart();
for (int v = 0; v < iterations; v++) function2(v);
sw.Stop();
Console.WriteLine("function2 no parallel: " + sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));

sw.Restart();
Parallel.For(0, iterations, function2);
sw.Stop();
Console.WriteLine("function2 parallel: " + sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));

The output on my machine:

function1   no parallel:  2 059,4 ms
function1 with parallel:    213,7 ms
function2   no parallel: 14 192,8 ms
function2      parallel:  4 491,1 ms

Environment:
Win 11, .Net 6.0, Release build
i9 12th gen, 16 cores, 24 proc, 32 GB DDR5

After testing more it seems the memory allocation does not scale that well with multiple threads. For example, if I change function 2 to:

void function2(int v) {
    Dictionary<int, int> dict = new Dictionary<int, int>(10000000);
}

The result is:

function2   no parallell:   124,0 ms
function2      parallell:   402,4 ms

Is the conclusion that memory allocation does not scale well with multiple threads?...

I'd strongly suggest re-building your test in [BenchmarkDotNet](https://benchmarkdotnet.org/articles/overview.html). You're not taking into account things like the fact that the first invocation of a function incurs the cost of JIT compilation, as one example of things that can trip you up. — Damien_The_Unbeliever, Mar 11 '22 at 07:27
I like the Q. Just tested it with way less cores on 2 core machine. Nearly no GC pressure from CLR. SQRT scales as expected, Dictionary only non-linear. — Falco Alexander, Mar 11 '22 at 08:05
Assumption: the important difference between the functions is that function1 is pure CPU, while function2 uses memory - which is a shared resource — Hans Keﬆing, Mar 11 '22 at 09:28
I tried adding an initial capacity to the dictionary, that improved the ratio a little, as I guess there are fewer allocations. like this: void function2(int v) { Dictionary dict = new Dictionary(10000000); for (int i = 0; i < 10000000; i++) { dict.Add(i, v); } } this gave a gain of 4.13x using parallell function2 no parallell: 7 358,0ms function2 parallell: 1 780,4ms — user989489, Mar 11 '22 at 11:20
Besides pieces of advice on how to not skew measured results by excluding the JIT-compilation one-stop add-on costs before .Stopwatch()-ed section ( it is also fair to mention, that speedup (sure, not here, in these ultra-trivial fun()-s) is very often skewed by not taking into account the unfair ( not the same ) code-execution conditions - so, as seen in so called "SUPERLINEAR speedup" argument, academia / vendors often take no care that "parallel"-mode enjoys nCores-many times larger cache-hit chances, that a baseline serial-(one core)-mode, thus they keep comparing apples to oranges ) — user3666197, Mar 12 '22 at 11:31
The (repaired,as noted above) test-measurements shall be run about 1000x more than just 1E2-outer loops.Then you start to see another set of adverse effects - coming from CPU thermal-management (hopping from one core to another,at add-on costs of 3x higher cross-QPI/NUMA costs+loss of all speed of re-use access to data expensively pre-fetched into L1/L2/L3 cache-lines, these add-on costs will have to be paid again - every time a thermal-hopping moves a code-execution flow from one core to another, colder one) and a bit later, having no cooler core, the core frequency goes down. Test 1E5+ loops — user3666197, Mar 12 '22 at 11:41
For more details on cross-QPI / NUMA memory-I/0 flows and other limits -- may check this https://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory/33065382#33065382 and i9 details here : https://ark.intel.com/content/www/us/en/ark/products/134597/intel-core-i912900-processor-30m-cache-up-to-5-10-ghz.html — user3666197, Mar 12 '22 at 11:42

score 2 · Accepted Answer · answered Mar 12 '22 at 11:00

tl;dr: Heap allocation contention.

Your first function is embarrassingly parallel. Each thread can do its computation with embarrassingly little interaction with other threads. So it scales up nicely to multiple threads. huseyin tugrul buyukisik correctly pointed out that your first computation makes use of the non-shared, per thread, processor registers.

Your second function, when it preallocates the dictionary, is somewhat less embarrassingly parallel. Each thread's computation is independent of the others' except for the fact that they each use your machine's RAM subsystem. So you see some thread-to-thread contention at the hardware level as thread-level cached data is written to and read from the machine-level RAM.

Your second function that does not preallocate memory is not embarrassingly parallel. Why not? Each .Add() operation must allocate some data in the shared heap. That can't be done in parallel, because all threads share the same heap. Rather they must be synchronized. The dotnet libraries do a good job of parallelizing heap operations as much as possible, but they do not avoid at least some blocking of thread B when thread A allocates heap data. So the threads slow each other down.

Separate processes rather than separate threads are a good way to scale up workloads like your non-preallocating second function. Each process has its own heap.

With due respect, Sir, the first function is even a constant-result producer ( so if compiler works smart enough, result is cache-able or even, if compiler has detected it has no side-effects ( except consuming energy to produce some heat ), it can shortcut the whole loop as a non-yielding DAG-singularity at all ). Last, but not least, the heap-management escape ( from threads into processes ) is not self-salvageable and the performance will still headbang into the ( never mentioned here ) principal glass-ceiling ...the memory-I/O... blocks - having just 2 (!) mem-chnls for the whole i9 CPU — user3666197, Mar 12 '22 at 11:09
For details about memory-I/0 flow limits -- may check this https://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory/33065382#33065382 and i9 details here : https://ark.intel.com/content/www/us/en/ark/products/134597/intel-core-i912900-processor-30m-cache-up-to-5-10-ghz.html — user3666197, Mar 12 '22 at 11:12

huseyin tugrul buyukisik · Answer 2 · 2022-03-11T12:08:09.057

1

First func works in registers. More cores = more registers.

Second func works on memory. More cores = only more L1 cache but shared RAM. 10million elements dataset certainly only come from RAM as even L3 is not big enough. This assumes jit of language optimizes allocations as reused buffers. If not, then there is allocation overhead too. So you should re-use dictionary on each new iteration instead of recreating.

Also you are saving data with incremental integer index. Simple array could work here, of course with re-use between iterations. It should have less memory footprint than a dictionary.

edited Mar 11 '22 at 12:08

answered Mar 11 '22 at 11:42

huseyin tugrul buyukisik

11,469
4
45
97

The CPU cache can't hold the 8*40MB allocated by the second function. All this is allocated in RAM and is affected by the CPU, chipset and motherboard architecture – Panagiotis Kanavos Mar 11 '22 at 11:45
U r right, actual dictionary overhead must be way more than just 4 bytes per pair. – huseyin tugrul buyukisik Mar 11 '22 at 11:47
1

It's not a contiguous memory area either, it's several buckets. This means that allocating 8 huge dictionaries concurrently can lead to synchronized access to the same RAM areas – Panagiotis Kanavos Mar 11 '22 at 11:55
No SIMD vectorization, constant allocations, it becomes 22 times slower despite square root. Maybe a temporary array would be better there is only increasing integer key usage. – huseyin tugrul buyukisik Mar 11 '22 at 12:01

Douglas Ferreira · Answer 3 · 2022-03-12T20:38:23.990

Parallel programming is not that simple. Using Parallel.For() or Parallel.ForEach() doesn't automatic make your program parallel. Parallel programming is not about calling any higher level function (in any programming language) to make your code parallel. Is about prepare your code to be parallel.

Actually, you are not paralleling anything at all neither func1 or func2. Backing to the foundation, the two basic types of parallelism are:

By task, which you split a complex task in smaller subtasks, each subtask to be processed at same time for different cores, CPUs or nodes (in a computer cluster)

By data, which you split a large data set into several smaller slices, each slice to be processed at same time for different cores, CPUs or nodes

Data parallelism is way more trickier to achieve and and not always provide a real performance gain.

Func1 is not really parallel, it's just a heavy piece of computation running concurrently. (Your CPU are just disputing who will finish the 100M for loop first) Using Parallel.For() you are just spawning this heavy function 100 times among your threads. A single for loop with Task.Run() inside would have nearly the same result

If your run this in only one thread/core obviously will take sometime. If you run in all your cores will be faster. No big mistery here, although being a concurrent code, not actually parallel. Besides, invoking these tasks 100 times, if you don't have these amount of CPU cores (or nodes in cluster) there's no big difference, parallel/concurrent code will be limit by the actual CPU cores in the machine (will see in a future example)

Now about the Func2 and the interaction with memory heap. Yes, every modern language with a built-in GC it's CPU expensive. One of the most expensive operation in an complex algorithm it's Garbage Collection, sometimes ad in non-optimized codes it can represents over 90% of CPU time.

Let's analyze your function2

Declare a new Dictionary into the function scope
Populate this Dictionary with 100M items
Outer the scope, you called function2 inside a Parallel.For with 100 interations
100 different scopes populate 100 different Dictionary with 100M data
There's no interaction between any of these scopes

As said before, this is not parallel programming, this is concurrent programming. You have separete 100 data chunks of 100M entries in each scope that doesn't intereact each other

But also there's a second factor too. Your function2 operation is a write operation (it means your adding-updading-deleting something to a collection). Well if it's just a bunch of random data and you can admit some loss and inconsistency okay. But if your're handling real data and cannot allow any kind of loss or inconsistency, bad news. There's no true parallel for writing a same memory address (object reference). You will need a synchronization contex and this will make things way slower, and these syncronized operations will always be concurrent, because if a thread is writing on memory reference, the other thread must wait until the other thread leaves. Actually, using several threads to write data might make your code slower instead faster, specially if the parallel operations are not CPU-bound.

For having real gains with data parallelism, you must have been using heavy computations uppon these partitioned data.

Let's check come code below, based on your methodology but with some changes:

var rand = new Random();
var operationSamples = 256;
var datasetSize = 100_000_000;
var computationDelay = 50;
var cpuCores = Environment.ProcessorCount;
Dictionary<int, int> datasetWithLoss = new(datasetSize);
Dictionary<int, int> dataset = new(datasetSize);
double result = 0;
Stopwatch sw = new();

ThreadPool.SetMinThreads(1, 1);

int HeavyComputation(int delay)
{
    int iterations = 0;
    var end = DateTime.Now + TimeSpan.FromMilliseconds(delay);
    while (DateTime.Now < end)
        iterations++;

    return iterations;
}

double SequentialMeanHeavyComputation(int maxMilliseconds, int samples = 64)
{
    double sum = 0;
    for (int i = 0; i < samples; i++)
        sum += HeavyComputation(maxMilliseconds);
    return sum / samples;
}

double ParallelMeanHeavyComputation(int maxSecondsCount, int samples = 64, int threads = 4)
{
    ThreadPool.SetMaxThreads(threads, threads);
    ThreadPool.GetAvailableThreads(out int workerThreads, out _);
    Console.WriteLine($"Available Threads: {workerThreads}");

    var _lockKey = new object();
    double sum = 0;
    int offset = samples / threads;
    List<Action> tasks = new();

    for (int i = 0; i < samples; i++)
        tasks.Add(new Action(() =>
        {
            var result = HeavyComputation(maxSecondsCount);
            lock (_lockKey)
                sum += result;
        }));

    Parallel.Invoke(new ParallelOptions { MaxDegreeOfParallelism = threads }, tasks.ToArray());

    return sum / samples;
}

void SequentialDatasetPopulation(int size)
{
    for (int i = 0; i < datasetSize; i++)
        dataset.TryAdd(i, Guid.NewGuid().GetHashCode());
}

void ParalellDatasetPopulation(int size, int threads)
{
    var _lock = new object();
    ThreadPool.SetMaxThreads(threads, threads);
    ThreadPool.GetAvailableThreads(out int workerThreads, out _);
    Console.WriteLine($"Available Threads: {workerThreads}");

    Parallel.For(0, datasetSize, new ParallelOptions { MaxDegreeOfParallelism = threads }, (i) =>
    {
        var value = Guid.NewGuid().GetHashCode();

        lock (_lock)
            dataset.Add(i, value);
    });
}

double SequentialReadOnlyDataset()
{
    foreach (var x in dataset)
    {
        HeavyComputation((int)Math.Tan(Math.Cbrt(Math.Log(Math.Log(x.Value)))) / 10);
    }

    return 0;
}

double ParallelReadOnlyDataset()
{
    Parallel.ForEach(dataset, x =>
    {
        HeavyComputation((int)Math.Tan(Math.Cbrt(Math.Log(Math.Log(x.Value)))) / 10);
    });
    return 0;
}

void ParalellDatasetWithLoss(int size, int threads)
{
    ThreadPool.SetMaxThreads(threads, threads);
    ThreadPool.GetAvailableThreads(out int workerThreads, out _);
    Console.WriteLine($"Available Threads: {workerThreads}");

    Parallel.For(0, datasetSize, new ParallelOptions { MaxDegreeOfParallelism = threads }, (i) =>
    {
        int value = Guid.NewGuid().GetHashCode();
        datasetWithLoss.Add(i, value);
    });
}

sw.Restart();
result = SequentialMeanHeavyComputation(computationDelay, operationSamples);
sw.Stop();
Console.WriteLine($"{nameof(SequentialMeanHeavyComputation)} sequential tasks: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");

sw.Restart();
result = ParallelMeanHeavyComputation(computationDelay, operationSamples, threads: cpuCores);
sw.Stop();

Console.WriteLine($"{nameof(ParallelMeanHeavyComputation)} parallel tasks (CPU threads match count): {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");

sw.Restart();
result = ParallelMeanHeavyComputation(computationDelay, operationSamples, threads: 100);
sw.Stop();
Console.WriteLine($"{nameof(ParallelMeanHeavyComputation)} parallel tasks (Higher thread count): {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");

sw.Restart();
result = ParallelMeanHeavyComputation(computationDelay, operationSamples, threads: 4);
sw.Stop();
Console.WriteLine($"{nameof(ParallelMeanHeavyComputation)} parallel tasks (Lower thread count): {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");

sw.Restart();
SequentialDatasetPopulation(datasetSize);
sw.Stop();
Console.WriteLine($"{nameof(SequentialDatasetPopulation)} sequential data population: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");

dataset.Clear();

sw.Restart();
ParalellDatasetPopulation(datasetSize, cpuCores);
sw.Stop();
Console.WriteLine($"{nameof(ParalellDatasetPopulation)} parallel data population: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");

sw.Restart();
ParalellDatasetWithLoss(datasetSize, cpuCores);
sw.Stop();
Console.WriteLine($"{nameof(ParalellDatasetWithLoss)} parallel data with loss: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");
Console.WriteLine($"Lossless dataset count: {dataset.Count}");
Console.WriteLine($"Dataset with loss: {datasetWithLoss.Count}\n");

datasetWithLoss.Clear();

sw.Restart();
SequentialReadOnlyDataset();
sw.Stop();
Console.WriteLine($"{nameof(SequentialReadOnlyDataset)} sequential reading operations: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");

sw.Restart();
ParallelReadOnlyDataset();
sw.Stop();
Console.WriteLine($"{nameof(ParallelReadOnlyDataset)} parallel reading operations: {sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms\n")}");

Console.Read();

Output:

SequentialMeanHeavyComputation sequential tasks: 12 800,7ms

Available Threads: 15
ParallelMeanHeavyComputation parallel tasks (CPU threads match count):  860,3ms

Available Threads: 99
ParallelMeanHeavyComputation parallel tasks (Higher thread count):  805,0ms

Available Threads: 3
ParallelMeanHeavyComputation parallel tasks (Lower thread count): 3 200,4ms

SequentialDatasetPopulation sequential data population: 9 072,4ms

Available Threads: 15
ParalellDatasetPopulation parallel data population: 23 420,0ms

Available Threads: 15
ParalellDatasetWithLoss parallel data with loss: 6 788,3ms

Lossless dataset count: 100000000
Dataset with loss: 77057456

SequentialReadOnlyDataset sequential reading operations: 20 371,0ms

ParallelReadOnlyDataset parallel reading operations: 3 020,6ms

(Red: 25%, Orange: 56%, Green: 75%, Blue: 100%)

With task parallelism we achieved over 20x performance using 100% of CPU threads. (in this example, not always like that)

In read-only data paralelism with some computation we achieve near 6,5x faster of CPU usage 56% (with fewer computations the difference would be shorter)

But trying to implement a "real parallism" of data for writing our performance is more than twice slower and CPU can't use full potential using only 25% usage due sycronization contexts

Conclusions: Using Parallel.For does not guarantee that your code will run really in parallel neither faster. It requires a previous code/data preparation and deep analysis, benchmarks and tunings

Check also this Microsoft Documentation talking about villains in Parallel Code

https://learn.microsoft.com/pt-br/dotnet/standard/parallel-programming/potential-pitfalls-in-data-and-task-parallelism

I don't like that you are trying to give new meaning to established concepts like "parallel" and "concurrent". Even if you are right and every one else uses these terms incorrectly, correcting our terminology has nothing to do with the question asked here. Hence my downvote. — Theodor Zoulias, Mar 13 '22 at 05:54
The main point of the question is why supposed Parallel.For method is slow in such scenario., like blaming .NET implementation as the root cause, but actually it isn't. Some other answers pointed the fact of the operations being bound to heap memory, but I showed that even with fewer allocations/freeing operations on heap the synchronization contexts might be the highest villain on this sample approach of "parallelism". I wasn't trying giving any meaning, just arguing the use of such API was not being used appropriate to reach its purpose effectively on its full potential. — Douglas Ferreira, Mar 13 '22 at 06:11

Why isn't Parallel.For fast with heap-intensive operations?

3 Answers3

Linked