Poor scaling of embarrassingly parallel work on many cores

Question

I am trying to parallelize a code on a many-core system. While investigating scaling bottlenecks, I ended up removing everything down to a (nearly) empty for-loop, and finding that the scaling is still only 75% at 28 cores. The example below cannot incur any false sharing, heap contention, or memory bandwidth issues. I see similar or worse effects on a number of machines running Linux or Mac, with physical core counts from 8 up to 56, all with the processors otherwise idling.

The plot shows a test on a dedicated HPC Linux node. It is a "weak scaling" test: the work load is proportional to the number of workers, and the vertical axis shows the rate-of-work done by all the threads combined, scaled to the ideal maximum for the hardware. Each thread runs 1 billion iterations of an empty for-loop. There is one trial for each thread count between 1 and 28. The run time is about 2 seconds per thread, so overhead from thread creation is not a factor.

Could this be the OS getting in our way? Or power consumption maybe? Can anybody produce an example of a calculation (however trivial, weak or strong) that exhibits 100% scaling on a high-core count machine?

Below is the C++ code to reproduce:

#include <vector>
#include <thread>

int main()
{
    auto work = [] ()
    {
        auto x = 0.0;

        for (auto i = 0; i < 1000000000; ++i)
        {
            // NOTE: behavior is similar whether or not work is
            // performed here (although if no work is done, you
            // cannot use an optimized build).

            x += std::exp(std::sin(x) + std::cos(x));
        }
        std::printf("-> %lf\n", x); // make sure the result is used
    };

    for (auto num_threads = 1; num_threads < 40; ++num_threads)
    {
        auto handles = std::vector<std::thread>();

        for (auto i = 0; i < num_threads; ++i)
        {
            handles.push_back(std::thread(work));
        }
        auto t0 = std::chrono::high_resolution_clock::now();

        for (auto &handle : handles)
        {
            handle.join();
        }
        auto t1 = std::chrono::high_resolution_clock::now();
        auto delta = std::chrono::duration<double, std::milli>(t1 - t0);

        std::printf("%d %0.2lf\n", num_threads, delta.count());
    }
    return 0;
}

To run the example, make sure to compile ~~without~~ with optimizations: g++ -O3 -std=c++17 weak_scaling.cpp. Here is Python code to reproduce the plot (assumes you pipe the program output to perf.dat).

import numpy as np
import matplotlib.pyplot as plt

threads, time = np.loadtxt("perf.dat").T
a = time[0] / 28
plt.axvline(28, c='k', lw=4, alpha=0.2, label='Physical cores (28)')
plt.plot(threads, a * threads / time, 'o', mfc='none')
plt.plot(threads, a * threads / time[0], label='Ideal scaling')

plt.legend()
plt.ylim(0.0, 1.)
plt.xlabel('Number of threads')
plt.ylabel('Rate of work (relative to ideal)')
plt.grid(alpha=0.5)
plt.title('Trivial weak scaling on Intel Xeon E5-2680v4')
plt.show()

Update -- here is the same scaling on a 56-core node, and that node's architecture:

Update -- there are concerns in the comments that the build was unoptimized. The result is very similar if work is done in the loop, the result is not discarded, and -O3 is used.

Interesting. How do you run the program? Additionally, do you use a "default" system configuration? (ie. did you change the configuration of the governor, hyper-threading, scheduling algorithm, frequency limits, etc.). — Jérôme Richard, Apr 20 '21 at 19:21
No, I've run tests on about a half-dozen machines, all in their default configurations. I didn't include thread-pinning in the example (to keep it simple), but core affinity did not change the result. — Jonathan Zrake, Apr 20 '21 at 19:25
Testing performance of a program compiled without optimizations is probably not useful, because when optimizations are disabled, the program is deliberately built in such a way as to be easier for a debugger (or human) to understand at the machine-code/assembly leve, rather than to be fast/efficient. As such, its performance doesn't tell us much (if anything) about "real-world conditions" where optimizations are always enabled. — Jeremy Friesner, Apr 20 '21 at 19:32
This could be related to power consumption and the thermal environment. With a few cores running all-out, and others idle, the processor has extra power and thermal capacity available and can run faster that it's rated speed (Turbo Boost). With all cores running all out it will slow down to (probably) the rated speed, although if it gets too hot it would slow even more. — 1201ProgramAlarm, Apr 20 '21 at 19:41
Run `watch -n.5 'grep "^cpu MHz" /proc/cpuinfo'` to see how CPU frequency changes as the test progresses. — rustyx, Apr 20 '21 at 19:46
I advise you to use the Linux perf tool to analyse the frequency and possibly other factors (with different number of threads). You can use the `perf stat` to start with. Many HPC machine does not provide enough access to hardware counters so you might have some issue using perf. However, many HPC vendor provide alternative tools like the Intel VTune on Intel-based machines and this tool can also analyse hardware counters like perf does (but using a different way which does not require too high privileges). — Jérôme Richard, Apr 20 '21 at 19:48
This is possibly a duplicate of [this question](https://stackoverflow.com/q/50924929/1329652). I advise you to verify that your `work` function actually behaves as expected **in the single threaded case first**. I believe that you ran headfirst into the fallout of untested assumptions. Make sure that the simplest things work first. You need to prove to yourself that its execution time scales linearly with amount of work. Only then try adding threads to the whole mix. Otherwise it's all smoke and mirrors and a waste of time. And use optimized builds. Always. — Kuba hasn't forgotten Monica, Apr 21 '21 at 16:42
@Kubahasn'tforgottenMonica -- it's not a duplicate; before posting I checked that the single-threaded execution was linear in the number of iterations. — Jonathan Zrake, Apr 22 '21 at 14:36
@JonathanZrake the number you plot as "Physical Cores" actually corresponds to the Logical Cores. Xeon E5-2680 has 14 physical cores and 28 logical cores. Pair of logical cores share many of the microarchitectural resources of a single physical core (e.g. execution units). That should explain scaling after 14 threads. Then if you ran debug build - memory bandwidth could be another problem, debug build would issue tons of load/store requests to the stack, a lot more than the release build, so should be better in release. — stepan, Apr 24 '21 at 18:29
@stepan The figure refers to the [Xeon 6238R](https://ark.intel.com/content/www/fr/fr/ark/products/199345/intel-xeon-gold-6238r-processor-38-5m-cache-2-20-ghz.html) with 28 cores and the code refers to the [Xeon E5-2680v4](https://ark.intel.com/content/www/fr/fr/ark/products/91754/intel-xeon-processor-e5-2680-v4-35m-cache-2-40-ghz.html) with 14 cores indeed. I think the figure is consistent with the title as the Xeon 6238R has 56 hardware threads and the scaling starts to drop at 56 threads. Data should be in the L1 cache (at least the L2) not shared between cores so it should not be the issue — Jérôme Richard, Apr 24 '21 at 19:48
@JérômeRichard right, my bad about the memory bandwidth - indeed should only use L1 in this case. — stepan, Apr 24 '21 at 20:17

score 0 · Answer 1 · answered Apr 21 '21 at 16:36

The test is meaningless because you don't run an optimized build and don't provide real work.

How can we know this? Because any recent gcc version will remove the useless for-loop, unless you disable optimization. So either you're compiling with optimizations disabled, or the for loop is simply absent.

When I added some real work to your work function, and ran an optimized build, the scaling is exactly as expected when work takes longer than about 10 seconds. Below about 100ms of work, operating system overheads make results noisy to the point of meaninglessness (on my particular platform).

Perhaps you're missing the fact that the for loop was optimized away, and are benchmarking thread creation and destruction, not any work done. Or you're benchmarking code built without optimizations. Do some real work. Compute something like a series expansion and print the result out at the end of each thread. You'll get to see the scaling as expected. And look at actual assembly output to make sure that the compiler doesn't statically convert the loop into a constant result. Modern compilers easily recognize e.g. summing arithmetic or geometric series based on constant input, and obligingly replace the computation with the final result.

Do not benchmark anything on unoptimized builds. It's mostly pointless, because you're actively disabling all the performance benefits that compiler optimizations provide. And don't benchmark code that doesn't actually do something, where you know for sure that the loop actually executes as many times as you thought, while doing computational work.

The work function takes 2 seconds, and scales linearly with the number of iterations. I hope you're not suggesting that thread creation takes half a second? Optimized build will slightly increase the _rate of work per core_ (if the result is not wasted). However it should not effect the _scaling_. — Jonathan Zrake, Apr 22 '21 at 14:17
Can you show an example where work is done in the loop, and you attain 100% scaling? With optimizations and work done (see updates above) I still get 80% on a 40-core node and 70% on a 56-core node. I still think it's thermal environment. — Jonathan Zrake, Apr 22 '21 at 14:54

Poor scaling of embarrassingly parallel work on many cores

1 Answers1