Overhead in parallel tasks in C++

Question

I have the following simple function

#include <iostream>
#include <vector>
#include <cmath>

const int numRows = 10000;
const int numCols = 1000;

void myFunc(int tau, const std::vector<std::vector<double>>& X, const std::vector<std::vector<double>>& P, std::vector<double>& PAutocorrelation) {

     for (int t = 0; t < numRows - tau; ++t) {

          for (int i = 0; i < numCols; ++i) {

               for (int j = 0; j < numCols; ++j) {

                    int diff = static_cast<int>(std::abs(X[t][i] - X[t + tau][j]));

                    PAutocorrelation[diff] += P[t][i] * P[t + tau][j];
               }
           }
      }
}

The function is supposed to perform three nested loops over the dimensions 10000, 1000 and 1000. If I call this function once works fine and the runtime is acceptable. But I would like to call this function 64 times in parallel as I have 64 CPU cores on my node. So I wrote the following code to do this (continuation of the above),

int main() {

std::vector<std::vector<double>> X(numRows, std::vector<double>(numCols, 1.0));
std::vector<std::vector<double>> P(numRows, std::vector<double>(numCols, 1.0));
std::vector<std::vector<double>> PAutocorrelations(63, std::vector<double>(11, 0.0));

#pragma omp parallel for
for (int tau = 1; tau < 64; ++tau) {
    myFunc(tau, X, P, PAutocorrelations[tau - 1]);
}

return 0;
}

The problem is that when I look at CPU load in htop during execution, I can see that at the beginning all 64 CPUs are loaded to 100% in green, but after that (and in most of the time) only one CPU is loaded.

I think this is a memory problem, but I have a hard time to understand this as the code does not involve copying of any variables. So once X and P are allocated, I do not see the need or the use for more memory. Could you indicate where the problem might be ?

I have opened another issue facing basically the same problem in Python here Overhead in embarrassingly parallel workload

Update

Below I show the result of perf stat -d /.main that I am still trying to interpret. But I see that backend cycles idle has the largest part (Why ?). When I compare perf record -e cache-misses ./main for a single run of the function and for 64 parallel runs, I see 400 cache-misses events in single run and 3 million cache-misses events for the parallel runs.

And below is the output of perf report after perf record -e cache-misses ./main. Here I can see 24 % of the cache-misses labeled as std::allocator <double>, which I am also not sure how to interpret.

What do you mean by memory leak? There is no manual memory allocation in this piece of code (std:;vector really should just cleanup after itself). — Pepijn Kramer, Aug 31 '23 at 09:40
Have you considered your approach just totally kills caching? Sometimes you just are better of running one dataset at a time (instead of runing 64 in parallel) and vectorize or parallelize that one computation. And even then that calculation should be optimized to process blocks of data that are "close" in memory. — Pepijn Kramer, Aug 31 '23 at 09:49
Besides the horrible performance because of the very inefficient code, it is easy to see that low `tau` values take longer to compute because they have more iterations in the outer loop. The default `schedule(static)` for OpenMP means that all low `tau` values are assigned to the first thread. That will then take longer than the others. Try `omp parallel for schedule(dynamic)` and see if it helps — Homer512, Aug 31 '23 at 10:04
Regarding general performance, "vector of vectors" is the absolute worst format for storing a matrix. There are [matrix libraries](https://eigen.tuxfamily.org/index.php?title=Main_Page) and [custom choices](https://stackoverflow.com/questions/2076624/c-matrix-class/2076668#2076668). Also relevant: [What is a "cache-friendly" code?](https://stackoverflow.com/questions/16699247/what-is-a-cache-friendly-code/16699282#16699282). Also, use [FFTs for autocorrelation](https://dsp.stackexchange.com/questions/1919/efficiently-calculating-autocorrelation-using-ffts) — Homer512, Aug 31 '23 at 10:13
Thanks for the suggestions. I prefer understanding the problem here before going to approximate methods like FFTs. — YoussefMabrouk, Aug 31 '23 at 10:20
The problem is not the cache nor `std::vector` here. I explained why in my answer. The scheduling does not seems much an issue on my machine. Besides, `schedule(dynamic)` cannot be faster because there are less iterations (63) than available threads (64). — Jérôme Richard, Aug 31 '23 at 12:23
Put it shortly, store-forwarding and NUMA effects are very likely the key issues here. — Jérôme Richard, Aug 31 '23 at 12:34
Thanks for the helpful comments. I have updated the question with more details, feel free to a look. — YoussefMabrouk, Aug 31 '23 at 16:28
I think 3e6 events is the number of samples gathered from perf while 18e6 is the number of misses (to be checked). That being said both 3e6 and 18e6 events seems very small for a program running 41 seconds. On x86-64 CPUs, it means a miss throughput of `18e6*64/1e9/41=0.028 GiB/s`. The RAM of a 64-core CPU can typically reach >100 GiB/s. So this is <1% overhead. I expect this to be due to RFOs on such architecture. — Jérôme Richard, Aug 31 '23 at 16:37
The call to `operator[]` are a bit scary. Did you enabled optimization? I expect GCC to optimize out such a function. What is your command-line to build the program? — Jérôme Richard, Aug 31 '23 at 16:38
I did not use optimizations, I build with `g++ -std=c++11 -fopenmp -o func func.cpp ` — YoussefMabrouk, Aug 31 '23 at 16:44
Well... Optimizations are disabled by default... Enabling them is critical in benchmarks and in-production usage. You need to add the `-O2` flag at least. `-O3` may be faster. Using `-march=native` can significantly speed up the program at the expense of making it not compatible with other machines (i.e. it means the binary need only to be run on the target machine). If you care about compatibility on other machines, you can enable `-mavx -mfma` instead : it should run on >90% of machines (the most recent ones) while being nearly as fast as `-march=native` in your case. — Jérôme Richard, Aug 31 '23 at 17:17
This smells an awful lot like https://stackoverflow.com/questions/15665433/c-mysteriously-huge-speedup-from-keeping-one-operand-in-a-register/15665995#15665995 in terms of optimization — UpAndAdam, Aug 31 '23 at 19:41

Jérôme Richard · Accepted Answer · 2023-08-31T16:40:21.897

I can see that at the beginning all 64 CPUs are loaded to 100% in green, but after that (and in most of the time) only one CPU is loaded.

This is because of several combined effects.

The first is a slight workload imbalance. Indeed, the first thread have more work than the last because of the first loop of the function myFunc: the first thread iterates from t=0 to numRows-tau=10000-1=9999 while the last thread iterates from t=0 to numRows-tau=10000-63=9937. This is not a big difference but sufficient to see some threads ending about a second before others after few minutes of computations.

The second is false sharing. Indeed, PAutocorrelation is shared between multiple threads. Each 11-item vector takes 88 bytes so it fits on 2 cache lines of 64 bytes but another vector can be allocated on the same cache line (allocations are generally made with an alignment of 16 bytes). In practice, the effect may or may not append regarding the target platform (dependent of the standard library).

The third is the OS, memory and the processor itself. The performance of the memory sub-system is quite unstable and this cannot be avoided as long as you access memory. Here, the computation is mostly compute-bound (at least clearly on my machine) so the impact of memory is not significant [1]. However, the OS can interrupt your threads for a small fraction of time and re-shedule them (causing some cache misses due to the context-switches). This is another source of performance instability between threads. Last but not least, there is a data-dependent read/store (on PAutocorrelation[diff]) which can make the computational time pretty different between thread. This is I think the major issue of this code.

All such time instability can cause some threads to end before others. When a thread has completed its work, it waits for other. This (required) final synchronization is typically the red part you can see on htop. Since there is 63 iterations to compute, only 63 threads can run (over 64) and threads cannot steal the work of other thread in the current code.

One can try to load balance the work using tasks, but I expect the overhead of tasks to be higher than the workload imbalance.

I think this is a memory problem, but I have a hard time to understand this as the code does not involve copying of any variables. So once X and P are allocated, I do not see the need or the use for more memory. Could you indicate where the problem might be ?

I see no memory issue in the code. I also see no memory issue when I run your program on my (Linux) machine. Variables are not expected to be copied since they are shared by default in OpenMP parallel for directives.

Notes about memory

[1] Since your CPU has 64 cores. It might be subject to NUMA effects. Especially using multi-socket nodes or AMD CPUs. On my PC, I do not have (much) NUMA effects so this would explain why I cannot observe a big work imbalance. In this cases, the memory needs to be initialized in parallel so to be spread on NUMA nodes. AFAIK, this is not (easily) possible with std::vectors. An alternative solution is to use numactl so to automatically interleave pages over memory banks. Such an automatic configuration is sub-optimal but often better than not paying attention to the distribution of pages on NUMA nodes. This is I think the most probable factor impacting the load balancing of this code on server CPUs.

How to make the program faster

While the comments provide several good guidelines to follow in order to make the program better. However, they miss the biggest performance issue in this specific code. Indeed:

cache matters, but the current code is compute-bound, so one should optimize the computation so to make it less compute-bound, not less memory-bound.
vector<vector<double>> is generally inefficient, but compilers can generate a pretty good code because accesses are contiguous and the last dimension is big (ie. 1000). This means the overhead of vector<vector<double>> is negligible.

The biggest issue is the data-dependent read/store. Indeed, writing in a cache line takes few cycle. Reading also takes few cycle. If you write in an array and then just read it again, you pay a significant latency of the L1 cache. Modern (x86-64) processors are very cleverly designed and they can optimize this using a store-forwarding strategy. The idea is that the processor keep the stored value in a kind of temporary register so when a load need the value, it does not have to wait for the value to be stored and read again. However, this processor optimization is far from being perfect : there is still generally an additional latency to pay due to store-forwarding. In fact, this is a big issue in this code because the store-forwarding is on the critical path of a very long chain of memory accesses. To make the program faster, we need to break the dependency chain. One way to do that is to use loop unrolling with independent temporary arrays.

Here is the optimized version:

static inline void computeCell(const std::vector<std::vector<double>>& X, const std::vector<std::vector<double>>& P, int tau, int t, int i, int j, std::vector<double>& arr)
{
    const int diff = static_cast<int>(std::abs(X[t][i] - X[t + tau][j]));
    arr[diff] += P[t][i] * P[t + tau][j];
}

void myFunc(int tau, const std::vector<std::vector<double>>& X, const std::vector<std::vector<double>>& P, std::vector<double>& PAutocorrelation)
{
    std::vector<double> tmp1(PAutocorrelation.size());
    std::vector<double> tmp2(PAutocorrelation.size());
    std::vector<double> tmp3(PAutocorrelation.size());
    std::vector<double> tmp4(PAutocorrelation.size());

    for (int t = 0; t < numRows - tau; ++t) {
        for (int i = 0; i < numCols; ++i) {
            for (int j = 0; j < numCols/4*4; j+=4) {
                computeCell(X, P, tau, t, i, j+0, tmp1);
                computeCell(X, P, tau, t, i, j+1, tmp2);
                computeCell(X, P, tau, t, i, j+2, tmp3);
                computeCell(X, P, tau, t, i, j+3, tmp4);
            }
            for (int j = numCols/4*4; j < numCols; ++j) {
                computeCell(X, P, tau, t, i, j, tmp1);
            }
        }
    }
    for (int i = 0; i < PAutocorrelation.size(); ++i)
        PAutocorrelation[i] += tmp1[i] + tmp2[i] + tmp3[i] + tmp4[i];
}

It is 2.5 faster on both GCC and Clang on my machine, both in sequential and in parallel. This is because different cache-lines are read-written and that each array can be read/store independently. The program can be made about 20% faster by using fixed-size static tmpXX arrays. This as also another benefit: stack-allocated arrays are guaranteed not to cause false-sharing in this case. Thus, it can increase a bit the scalability. On my machine (i5-9600KF with 6 cores), the program is relatively fast and scale very well.

The key for better performance would be to remove this data-dependent read/store, but this does not seems possible here (it could be if diff would be smaller).

PS: I used the following compilation flags: -O3 -fopenmp -mavx2 -mfma.

Thanks for helpful comment. Unfortunately the 2.5 factor is not reproducible on my machine. Nevertheless thanks for the explanations and I still try to consider your suggestions. — YoussefMabrouk, Aug 31 '23 at 14:52
If you enable optimizations, I think you should see a speed up (unrolling only make sense with optimizations). I never tested this strategy on AMD CPUs, but I think it should worth it. — Jérôme Richard, Aug 31 '23 at 17:20
With `-O3 -fopenmp -march=native` the 2.5 factor for a single run of the function is reproducible on my machine. Most importantly, the decrease of CPU load during the parallel runs are not observed anymore in `htop`. The runtime of the parallel runs is now not more than 1.5 times the single run, and this factor seems to be stable when the array dimensions are increased. Without the loop unrolling approach suggested here, the overhead from parallel runs is close to 10 for the largest arrays I tested. So a speedup of factor of 10 was gained. Thanks — YoussefMabrouk, Sep 01 '23 at 07:10
But a follow up question : Why did you choose to fill 4 arrays ? Where does the number 4 come from ? — YoussefMabrouk, Sep 01 '23 at 07:10
The theoretical explanation is a bit complex. In practice, you can juste test multiple unrolling factor and choose the best one which is rather small (<=8). This is honestly what I did here. The factor can change from one architecture to another, but 4 tends to be good on most modern mainstream processors. — Jérôme Richard, Sep 02 '23 at 08:10
Regarding the theoretical explanation, the idea is that each instruction has a given latency (architecture dependent) and throughput. The idea is to overlap the latency with computation so to mitigate the impact. The optimal factor should be latency/throughput but it is hard to compute, mainly because this is dependent on the memory hierarchy, it requires a very deep understanding of the target architecture and some informations are sometimes not even provided by the CPU vendor (the Apple M1 is a good example for that)... CPUs with higher ILP tends to benefit from higher unrolling factors. — Jérôme Richard, Sep 02 '23 at 08:14

Overhead in parallel tasks in C++

1 Answers1

Notes about memory

How to make the program faster

Linked