1

Since MPI-3 comes with functionality for shared memory parallelism, and it seems to be perfectly matched for my application, I'm critically considering rewriting my hybrid OpemMP-MPI code into a pure MPI implementation.

In order to drive the last nail into the coffin, I decided to run a small program to test the latency of the OpenMP fork/join mechanism. Here's the code (written for Intel compiler):

void action1(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = std::sin(t2.data()[index]) * std::cos(t2.data()[index]);
    }
}

void action2(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = t2.data()[index] * std::sin(t2.data()[index]);
    }
}

void action3(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = t2.data()[index] * t2.data()[index];
    }
}

void action4(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = std::sqrt(t2.data()[index]);
    }
}

void action5(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = t2.data()[index] * 2.0;
    }
}

void all_actions(std::vector<double>& t1, std::vector<double>& t2)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = std::sin(t2.data()[index]) * std::cos(t2.data()[index]);
        t1.data()[index] = t2.data()[index] * std::sin(t2.data()[index]);
        t1.data()[index] = t2.data()[index] * t2.data()[index];
        t1.data()[index] = std::sqrt(t2.data()[index]);
        t1.data()[index] = t2.data()[index] * 2.0;
    }
}


int main()
{
    // decide the process parameters
    const auto n = std::size_t{8000000};
    const auto test_count = std::size_t{500};
    
    // garbage data...
    auto t1 = std::vector<double>(n);
    auto t2 = std::vector<double>(n);
    
    //+/////////////////
    // perform actions one after the other
    //+/////////////////
    
    const auto sp = timer::spot_timer();
    const auto dur1 = sp.duration_in_us();
    for (auto index = std::size_t{}; index < test_count; ++index)
    {
        #pragma noinline
        action1(t1, t2);
        #pragma noinline
        action2(t1, t2);
        #pragma noinline
        action3(t1, t2);
        #pragma noinline
        action4(t1, t2);
        #pragma noinline
        action5(t1, t2);
    }
    const auto dur2 = sp.duration_in_us();
    
    //+/////////////////
    // perform all actions at once
    //+/////////////////
    const auto dur3 = sp.duration_in_us();
    for (auto index = std::size_t{}; index < test_count; ++index)
    {
        #pragma noinline
        all_actions(t1, t2);
    }
    const auto dur4 = sp.duration_in_us();
    
    const auto a = dur2 - dur1;
    const auto b = dur4 - dur3;
    if (a < b)
    {
        throw std::logic_error("negative_latency_error");
    }
    const auto fork_join_latency = (a - b) / (test_count * 4);
    
    // report
    std::cout << "Ran the program with " << omp_get_max_threads() << ", the calculated fork/join latency is: " << fork_join_latency << " us" << std::endl;
    
    return 0;
}

As you can see, the idea is to perform a set of actions separately (each within an OpenMP loop) and to calculate the average duration of this, and then to perform all these actions together (within the same OpenMP loop) and to calculate the average duration of that. Then we have a linear system of equations in two variables, one of which is the latency of the fork/join mechanism, which can be solved to obtain the value.

Questions:

  1. Am I overlooking something?
  2. Currently, I am using "-O0" to prevent smarty-pants compiler from doing its funny business. Which compiler optimizations should I use, would these also have an effect on the latency itself etc etc?
  3. On my Coffee Lake processor with 6 cores, I measured a latency of ~850 us. Does this sound about right?

Edit 3

  1. ) I've included a warm-up calculation in the beginning upon @paleonix's suggestion,

  2. ) I've reduced the number of actions for simplicity, and,

  3. ) I've switched to 'omp_get_wtime' to make it universally understandable.

I am now running the following code with flag -O3:

void action1(std::vector<double>& t1)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = std::sin(t1.data()[index]);
    }
}

void action2(std::vector<double>& t1)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] =  std::cos(t1.data()[index]);
    }
}

void action3(std::vector<double>& t1)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        t1.data()[index] = std::atan(t1.data()[index]);
    }
}

void all_actions(std::vector<double>& t1, std::vector<double>& t2, std::vector<double>& t3)
{
    #pragma omp parallel for schedule(static) num_threads(std::thread::hardware_concurrency())
    for (auto index = std::size_t{}; index < t1.size(); ++index)
    {
        #pragma optimize("", off)
        t1.data()[index] = std::sin(t1.data()[index]);
        t2.data()[index] = std::cos(t2.data()[index]);
        t3.data()[index] = std::atan(t3.data()[index]);
        #pragma optimize("", on)
    }
}


int main()
{
    // decide the process parameters
    const auto n = std::size_t{1500000}; // 12 MB (way too big for any cache)
    const auto experiment_count = std::size_t{1000};
    
    // garbage data...
    auto t1 = std::vector<double>(n);
    auto t2 = std::vector<double>(n);
    auto t3 = std::vector<double>(n);
    auto t4 = std::vector<double>(n);
    auto t5 = std::vector<double>(n);
    auto t6 = std::vector<double>(n);
    auto t7 = std::vector<double>(n);
    auto t8 = std::vector<double>(n);
    auto t9 = std::vector<double>(n);
    
    //+/////////////////
    // warum-up, initialization of threads etc.
    //+/////////////////
    for (auto index = std::size_t{}; index < experiment_count / 10; ++index)
    {
        all_actions(t1, t2, t3);
    }
    
    //+/////////////////
    // perform actions (part A)
    //+/////////////////
    
    const auto dur1 = omp_get_wtime();
    for (auto index = std::size_t{}; index < experiment_count; ++index)
    {
        action1(t4);
        action2(t5);
        action3(t6);
    }
    const auto dur2 = omp_get_wtime();
    
    //+/////////////////
    // perform all actions at once (part B)
    //+/////////////////

    const auto dur3 = omp_get_wtime();
    #pragma nofusion
    for (auto index = std::size_t{}; index < experiment_count; ++index)
    {
        all_actions(t7, t8, t9);
    }
    const auto dur4 = omp_get_wtime();
    
    const auto a = dur2 - dur1;
    const auto b = dur4 - dur3;
    const auto fork_join_latency = (a - b) / (experiment_count * 2);
    
    // report
    std::cout << "Ran the program with " << omp_get_max_threads() << ", the calculated fork/join latency is: "
        << fork_join_latency * 1E+6 << " us" << std::endl;
    
    return 0;
}

With this, the measured latency is now 115 us. What's puzzling me now is that this value changes when the actions are changed. According to my logic, since I'm doing the same action in both parts A and B, there should actually be no change. Why is this happening?

Nitin Malapally
  • 534
  • 2
  • 10
  • 2
    Depending on `n` your measurement might be dominated by cache effects. It might be more fair to have one input and one output buffer for each operation. Then `all_actions` wont be optimized away that easily, too. Naturally you will still have to e.g. write the output to a file or sth like that. – paleonix Feb 11 '22 at 10:43
  • 1
    By not letting the compiler inline, you might also measure function call overhead, although this is probably a rather small factor overall. – paleonix Feb 11 '22 at 10:49
  • It might be even easier to do reductions on all the calculations. The output will be small (could just write it to `std::cout`) and you depend less on cache effects/memory bandwidth. – paleonix Feb 11 '22 at 10:52
  • @rustyx If I were to do that, I'd have to deal with loop-unrolling, which would even change the number of iterations handed out to each thread in the different sections, wouldn't it? Then the work done in part A (separate OpenMP loops) might be different from the work done in part B (single OpenMP loop), which must be constant for this measurement to work. – Nitin Malapally Feb 11 '22 at 10:52
  • @paleonix 1. Thanks for the tip about the cache effects, I had not considered that at all! Yes, each action requires a separate buffer. 2. True, but allowing inlining might let other in-loop optimizations take place... – Nitin Malapally Feb 11 '22 at 10:55
  • 1
    @MotiveHunter Without optimizations your results will be useless (or do you plan do use your application without optimizations as well?). You could manually turn off loop unrolling, if you are concerned with that. – paleonix Feb 11 '22 at 10:56
  • 1
    On your hardware, with a decent OpenMP runtime (IOMP is fine) and optimization enabled, the overhead of using an OpenMP parallel for should be clearly less than 10 us. I personally expect something like 0.5~1 us. With 850 us, you are definitively measuring wrong things or threads are somehow recreated every time (it should not be the case with IOMP unless the number of threads changes during the benchmark). By the way, I do not know what is `timer::spot_timer` but you should be careful about how the time is measured. Consider using `omp_get_wtime` of OpenMP. – Jérôme Richard Feb 11 '22 at 11:48
  • By calling e.g. `all_actions` once before starting the measurements, you could make sure that you are not measuring thread creation in the first measurement, while the threads already exist in a threadpool for the second measurement (implementation dependent). – paleonix Feb 11 '22 at 12:08
  • 1
    I was inspired to play around a bit. [This](https://godbolt.org/z/MjfeW7Ga6) is probably not ideal either, but maybe you find it interesting. You probably shouldn't trust the results in Compiler Explorer, but run the code in a controlled environment where you know exactly what is the number of cores at your disposal. The idea was to do sth in between the fork and the join that can't be optimized away and takes a somewhat predictable amount of time. – paleonix Feb 11 '22 at 14:14
  • @paleonix Very elegant solution, hadn't thought in this direction. Yes, it tallies, we're talking about ~150 - 300 us depending on the number of threads. – Nitin Malapally Feb 11 '22 at 14:57
  • @MotiveHunter What implementation/compiler/compiler version are you using? – paleonix Feb 11 '22 at 15:11
  • @paleonix Intel 20.2.5 – Nitin Malapally Feb 11 '22 at 15:13

3 Answers3

2

Here is my attempt at measuring fork-join overhead:

#include <iostream>
#include <string>

#include <omp.h>

constexpr int n_warmup = 10'000;
constexpr int n_measurement = 100'000;
constexpr int n_spins = 1'000;

void spin() {
    volatile bool flag = false;
    for (int i = 0; i < n_spins; ++i) {
        if (flag) {
            break;
        }
    }
}

void bench_fork_join(int num_threads) {
    omp_set_num_threads(num_threads);

    // create threads, warmup
    for (int i = 0; i < n_warmup; ++i) {
        #pragma omp parallel
        spin();
    }

    double const start = omp_get_wtime();
    for (int i = 0; i < n_measurement; ++i) {
        #pragma omp parallel
        spin();
    }
    double const stop = omp_get_wtime();
    double const ptime = (stop - start) * 1e6 / n_measurement;

    // warmup
    for (int i = 0; i < n_warmup; ++i) {
        spin();
    }
    double const sstart = omp_get_wtime();
    for (int i = 0; i < n_measurement; ++i) {
        spin();
    }
    double const sstop = omp_get_wtime();
    double const stime = (sstop - sstart) * 1e6 / n_measurement;

    std::cout << ptime << " us\t- " << stime << " us\t= " << ptime - stime << " us\n";
}

int main(int argc, char **argv) {
    auto const params = argc - 1;
    std::cout << "parallel\t- sequential\t= overhead\n";

    for (int j = 0; j < params; ++j) {
        auto num_threads = std::stoi(argv[1 + j]);
        std::cout << "---------------- num_threads = " << num_threads << " ----------------\n";
        bench_fork_join(num_threads);
    }

    return 0;
}

You can call it with multiple different numbers of threads which should not be higher then the number of cores on your machine to give reasonable results. On my machine with 6 cores and compiling with gcc 11.2, I get

$ g++ -fopenmp -O3 -DNDEBUG -o bench-omp-fork-join bench-omp-fork-join.cpp
$ ./bench-omp-fork-join 6 4 2 1
parallel        - sequential    = overhead
---------------- num_threads = 6 ----------------
1.51439 us      - 0.273195 us   = 1.24119 us
---------------- num_threads = 4 ----------------
1.24683 us      - 0.276122 us   = 0.970708 us
---------------- num_threads = 2 ----------------
1.10637 us      - 0.270865 us   = 0.835501 us
---------------- num_threads = 1 ----------------
0.708679 us     - 0.269508 us   = 0.439171 us

In each line the first number is the average (over 100'000 iterations) with threads and the second number is the average without threads. The last number is the difference between the first two and should be an upper bound on the fork-join overhead.

Make sure that the numbers in the middle column (no threads) are approximately the same in every row, as they should be independent of the number of threads. If they aren't, make sure there is nothing else running on the computer and/or increase the number of measurements and/or warmup runs.

In regard to exchanging OpenMP for MPI, keep in mind that MPI is still multiprocessing and not multithreading. You might pay a lot of memory overhead because processes tend to be much bigger than threads.

EDIT:

Revised benchmark to use spinning on a volatile flag instead of sleeping (Thanks @Jérôme Richard). As Jérôme Richard mentioned in his answer, the measured overhead grows with n_spins. Setting n_spins below 1000 didn't significantly change the measurement for me, so that is where I measured. As one can see above, the measured overhead is way lower than what the earlier version of the benchmark measured.

The inaccuracy of sleeping is a problem especially because one will always measure the thread that sleeps the longest and therefore get a bias to longer times, even if sleep times themselves would be distributed symmetrically around the input time.

paleonix
  • 2,293
  • 1
  • 13
  • 29
  • 1
    Sleep functions are generally not very accurate when the waiting time is small. This is especially true on Windows where the granularity is about few ms. Even on Linux, the granularity is hardly less than few dozens of us. Sleep causes expensive contexts switches the behavior of the thread execution. This include P-states (ie. lower core frequency) and C-State switches (higher wake-up latency). For more information, please read [this post](https://stackoverflow.com/questions/1133857/how-accurate-is-pythons-time-sleep). A busy-waiting loop may be better for such a benchmark. – Jérôme Richard Feb 11 '22 at 17:08
  • *"In regard to exchanging OpenMP for MPI, keep in mind that MPI is still multiprocessing and not multithreading. You might pay a lot of memory overhead because processes tend to be much bigger than threads."* The possibility I'm considering is using the shared memory parallelization feature-set of the MPI 3.0+ standards. Using this one can create a window of shared memory for each process, which means although each process has its own stack, it can effectively manage only its own data. See [here](https://pages.tacc.utexas.edu/~eijkhout/pcse/html/mpi-shared.html) for more info. – Nitin Malapally Feb 13 '22 at 14:49
2

TL;DR: cores do not operate exactly at the same speed due to dynamic frequency scaling and there is a lot of noise that can impact the execution resulting in expensive synchronizations. Your benchmark mostly measures this synchronization overhead. Using a unique parallel section should solve this problem.


The benchmark is rather flawed. This code does not actually measure the "latency" of OpenMP fork/join sections. It measures a combinations of many overheads including:

load balancing and synchronizations: the split loops perform more frequent synchronizations (5 times more) than the big merged one. Synchronizations are expensive, not because of the communication overhead, but because of the actual synchronization between cores that are inherently not synchronized. Indeed, a slight work-imbalance between the threads results in other thread waiting for the completion of the slowest thread. You might think this should not happen because of static scheduling, but context switches and dynamic frequency scaling cause some threads to be slower than others. Context switches are especially important if threads are not bound to core or if some programs are scheduled by the OS during the computation. Dynamic frequency scaling (eg. Intel turbo boost) causes some (group of threads) to be faster regarding the workload, the temperature of each core and the overall package, the number of active cores, the estimated power consumption, etc. The higher the number of core, the higher the synchronization overheads. Note that this overhead is dependent of the time taken by the loops. For more information, please read the below analysis.

Performance of loop splitting: merging the 5 loops into a unique one impacts the generated assembly code (since less instructions are needed) and also impacts load/store in the cache (since the memory access pattern is a bit different). Not to mention it could theoretically impact vectorization although ICC does not vectorize this specific code. That being said, this does not appear to be the main actual issue on my machine since I am not able to reproduce the problem with Clang by running the program in sequential while I can with many threads.

To solve this problem you can use a unique parallel section. omp for loops must use the nowait clause so not to introduce synchronizations. Alternatively, task-based construct like taskloop with the nogroup can help to achieve the same thing. In both case you should be careful about dependencies since multiple for-loop/task-loos can run in parallel. This is fine in you current code.


Analysis

Analyzing the effect of short synchronizations caused by execution noise (context switches, frequency scaling, cache effect, OS interruptions, etc.) is pretty hard since it is likely never the same thread that is the slowest during synchronizations in your case (the work between thread is quite balance but they velocity is not perfectly equal).

That being said, if this hypothesis is true fork_join_latency should be dependent of n. Thus, increasing n also increase fork_join_latency. Here is was I can get with Clang 13 + IOMP (using -fopenmp -O3) on my 6-core i5-9600KF processor:

n=   80'000    fork_join_latency<0.000001
n=  800'000    fork_join_latency=0.000036
n= 8'000'000   fork_join_latency=0.000288
n=80'000'000   fork_join_latency=0.003236

Note that the fork_join_latency timings are not very stable in practice but the behavior is pretty visible: the measured overhead is dependent n.

A better solution is to measure the synchronization time by measuring the time of the loop for each thread and accumulate the difference between the minimum and maximum time. Here is an example of code:

double totalSyncTime = 0.0;

void action1(std::vector<double>& t1)
{
    constexpr int threadCount = 6;
    double timePerThread[threadCount] = {0};

    #pragma omp parallel
    {
        const double start = omp_get_wtime();
        #pragma omp for nowait schedule(static) //num_threads(std::thread::hardware_concurrency())
        #pragma nounroll
        for (auto index = std::size_t{}; index < t1.size(); ++index)
        {
            t1.data()[index] = std::sin(t1.data()[index]);
        }
        const double stop = omp_get_wtime();
        const double threadLoopTime = (stop - start);
        timePerThread[omp_get_thread_num()] = threadLoopTime;
    }

    const double mini = *std::min_element(timePerThread, timePerThread+threadCount);
    const double maxi = *std::max_element(timePerThread, timePerThread+threadCount);
    const double syncTime = maxi - mini;
    totalSyncTime += syncTime;
}

You can then divide totalSyncTime the same way you did for fork_join_latency and print the result. I get 0.000284 with fork_join_latency=0.000398 (with n=8'000'000) which almost proves that a major part of the overhead is due to synchronizations and more especially due to a slightly different thread execution velocity. Note that this overhead does not include the implicit barrier at the end of the OpenMP parallel section.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • *"You might think this should not happen because of static scheduling, but context switches and dynamic frequency scaling cause some threads to be slower than others."* I've disabled dynamic frequency scaling on my test machine using kernel tools. Also, context switches can happen in the fused loop as well, so isn't it white noise while averaged out over many experiments? – Nitin Malapally Feb 11 '22 at 17:19
  • *"...also impacts load/store in the cache (since the memory access pattern is a bit different)..."* That's true, but this shouldn't happen with unique buffers for each run and streaming access. – Nitin Malapally Feb 11 '22 at 17:22
  • It's a strong argument that my benchmark program was actually measuring the sync. overheads. Although, as you mentioned in your answer, I assumed this would not happen if I switched off dynamic frequency scaling (using 'userspace' governor and setting specific clock frequency) and used static scheduling. As for context switches, please see one of the above comments. – Nitin Malapally Feb 11 '22 at 17:29
  • Ok for the frequency scaling (though I advise you to check if it is truly the case since the last time I tried that, it was not fully working). For the context switches as well as many other source of overheads, it indeed occurs in both cases, but over-synchronizations cause waiting times that does not appear in the merged loop. These waiting time cannot be naively accumulated as source of noise as generally not on the same threads. – Jérôme Richard Feb 11 '22 at 17:31
  • For the caches, it is a bit complex to explain as it is related to very low-level behavior of Intel/AMD x86 processors. Such effects often happens but a deeper analysis on my machine show that it does not appear to significantly impact the results here. So let's skip this part. For more information about this, you can search for [cache trashing](https://en.wikipedia.org/wiki/Thrashing_(computer_science)) and hardware prefetching (in [Intel manuals](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html)). – Jérôme Richard Feb 11 '22 at 17:38
  • Note that there are many other source of noise I skipped in my answer. For example, one main possible source is the memory throughput that may not be perfectly balanced between threads. I also saw that the number of page faults and branch miss-prediction as well as cache misses and context swiches are not the same on all cores on my machine during the benchmark. Further analysis can be made to find the exact origin, but I think it is better to deal with such noise rather than trying to remove it. The source of noise are increasing over the last decades and it is likely not going to stop soon. – Jérôme Richard Feb 11 '22 at 17:47
  • Yes, the setting of a particular CPU frequency for all cores works quite well on my system using *cpupower* on Ubuntu 20.04 with *acpi-freq* CPU driver. – Nitin Malapally Feb 13 '22 at 14:40
  • I ran your code on my computer without frequency scaling turned off, and got an average value of 0.9 us (averaged over 100 samples, each with its own buffer) and 0.2 us with frequency scaling turned off. – Nitin Malapally Feb 13 '22 at 14:52
  • Ok so compared to the previous timing. This means that the overhead is certainly located at the beginning/ending of the OpenMP section. Unless you use a custom/modified IOMP, it should not create new threads so this should not be the cause. The thing is that IOMP is designed so the overhead in this case should be clearly much less than 100 us on a PC... Do you use a passive wait policy? Can you try with `OMP_WAIT_POLICY=ACTIVE`. Alternatively, C-states can play a role but I cannot reproduce this on my machine (and AFAIK they cannot be disabled)... What precise processor do you use by the way? – Jérôme Richard Feb 13 '22 at 15:55
  • I'm using an unmodified IOMP. I can try out what you said and get back to you. I have the Intel Core i5 8500 processor. – Nitin Malapally Feb 13 '22 at 18:33
  • With OMP_WAIT_POLICY=ACTIVE, I had an average measurement of 0.2 us and with PASSIVE, 0.9 us. – Nitin Malapally Feb 14 '22 at 07:38
1

See my answer to a related question: https://stackoverflow.com/a/71812329/2044454

TLDR: I split 10k parallel loops into x outside the parallel region, and 10k/x inside. Conclusion is that the cost of starting a parallel region is basically zip.

Victor Eijkhout
  • 5,088
  • 2
  • 22
  • 23