Large performance variability of fully CPU-bound code for no clear reason

Question

I apologize for the lack of a minimal reproducible example for the problem that follows. I will continue investigating, and if I isolate it to a small section of code, I will add this information to the question. What I'm really looking for are new avenues of investigation, rather than a precise answer.

Context

I wrote some C code which, when compiled under gcc with -O3, show considerable variation in execution time from one run to the next on the exact same hardware, upwards of 15%. I am not aware of any possible source of variability across runs, as it's the same binary running on the same hardware using the same "input" (see a few paragraphs below).

The same effect occurs on two different boards. I am not comparing results across boards. The issue is upwards of 15% runtime variability of the exact same code, compiled to the exact same binary by the exact same compiler and optimization options, running in the same hardware, with the same input. I am merely showing that the issue, whatever it is, is not restricted to a single piece of hardware. In fact, pretend I only had one of the two boards below -- I would still be asking exactly the question.

The boards are:

Nvidia Jetson Nano, 4 x Cortex-A57 @ 1.479 GHz, gcc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
Raspberry Pi 4 model B, 4 x Cortex-A72 @ 1.5 GHz, gcc (Ubuntu 10.3.0-1ubuntu1) 10.3.0

I use Google Benchmark to measure timings -- note, however, that the variability does not go away when running the code standalone (without Google Benchmark), inside a loop, and timing with the time shell command.

The code is fully CPU-bound (a cryptographic algorithm). The code is fully deterministic: a pseudorandom number generator is used, but it's always initialized to the same seed. The working set probably fits the L1 D-cache, and certainly fits the L2 cache. Absolutely no I/O is performed at all by the code.

Timings are very repeatable, on both boards, when compiling the code without optimization. Timings are also very repeatable with clang with or without optimization. Clang versions are:

clang version 10.0.0-4ubuntu1~18.04.2 in the Jetson Nano
Ubuntu clang version 12.0.0-3ubuntu1~21.04.1 in the Raspberry Pi

Thus, the variability only happens with gcc, with -O3 optimizations, with two different boards, and two different versions of gcc.

For reference, this is a log of 30 repetitions of the benchmark, in the Jetson Nano board, for the version compiled with gcc and -O3 optimizations:

--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
XXXXXXXXXXXX               433 us          432 us         1628
XXXXXXXXXXXX               431 us          430 us         1628
XXXXXXXXXXXX               418 us          417 us         1628
XXXXXXXXXXXX               414 us          413 us         1628
XXXXXXXXXXXX               414 us          414 us         1628
XXXXXXXXXXXX               414 us          413 us         1628
XXXXXXXXXXXX               389 us          388 us         1628
XXXXXXXXXXXX               366 us          366 us         1628
XXXXXXXXXXXX               366 us          365 us         1628
XXXXXXXXXXXX               367 us          366 us         1628
XXXXXXXXXXXX               368 us          367 us         1628
XXXXXXXXXXXX               366 us          365 us         1628
XXXXXXXXXXXX               366 us          365 us         1628
XXXXXXXXXXXX               366 us          365 us         1628
XXXXXXXXXXXX               366 us          365 us         1628
XXXXXXXXXXXX               366 us          365 us         1628
XXXXXXXXXXXX               367 us          366 us         1628
XXXXXXXXXXXX               366 us          365 us         1628
XXXXXXXXXXXX               366 us          365 us         1628
XXXXXXXXXXXX               366 us          366 us         1628
XXXXXXXXXXXX               366 us          366 us         1628
XXXXXXXXXXXX               366 us          366 us         1628
XXXXXXXXXXXX               366 us          366 us         1628
XXXXXXXXXXXX               366 us          365 us         1628
XXXXXXXXXXXX               366 us          366 us         1628
XXXXXXXXXXXX               366 us          366 us         1628
XXXXXXXXXXXX               366 us          366 us         1628
XXXXXXXXXXXX               366 us          365 us         1628
XXXXXXXXXXXX               366 us          365 us         1628
XXXXXXXXXXXX               366 us          366 us         1628
XXXXXXXXXXXX_mean          378 us          377 us           30
XXXXXXXXXXXX_median        366 us          366 us           30
XXXXXXXXXXXX_stddev       22.4 us         22.4 us           30

Note that there are runs where all timings are close to 430 us, whereas in the next, all timings are close to 365 us, as well as those that start slower and finish faster, as shown. It's also typical to get values close to 415 us, 400 us and 380 us. The behavior in the Raspberry Pi 4 board is quite similar.

EDIT: as requested in the comments section, here is a run with a fixed 10,000 iterations:

-------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations
-------------------------------------------------------------------------------
XXXXXXXXXXXX/iterations:10000               421 us          421 us        10000
XXXXXXXXXXXX/iterations:10000               421 us          421 us        10000
XXXXXXXXXXXX/iterations:10000               421 us          421 us        10000
XXXXXXXXXXXX/iterations:10000               421 us          421 us        10000
XXXXXXXXXXXX/iterations:10000               420 us          420 us        10000
XXXXXXXXXXXX/iterations:10000               420 us          420 us        10000
XXXXXXXXXXXX/iterations:10000               398 us          398 us        10000
XXXXXXXXXXXX/iterations:10000               372 us          372 us        10000
XXXXXXXXXXXX/iterations:10000               372 us          372 us        10000
XXXXXXXXXXXX/iterations:10000               372 us          372 us        10000
XXXXXXXXXXXX/iterations:10000_mean          404 us          404 us           10
XXXXXXXXXXXX/iterations:10000_median        420 us          420 us           10
XXXXXXXXXXXX/iterations:10000_stddev       23.0 us         23.0 us           10

END EDIT

For reference, here is a run in the same board, with clang and -O3 optimizations:

--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          256 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX               257 us          256 us         2726
XXXXXXXXXXXX               257 us          257 us         2726
XXXXXXXXXXXX_mean          257 us          257 us           30
XXXXXXXXXXXX_median        257 us          257 us           30
XXXXXXXXXXXX_stddev      0.092 us        0.063 us           30

After dozens of runs, the results are always very similar, varying no more than 1 us or so.

Factors investigated

The results above just scream of dynamic voltage frequency scaling/turbo boost/throttling or similar effects. This was the first avenue of investigation. In the Jetson Nano, I use the nvpmodel utility to select the MAXN (10 W) setting, as well as the jetson_clocks script to force the maximum possible CPU clock. I also use cpufreq-set from the cpufrequtils package to override the scaling governor to performance. I monitor the CPU clocks, temperatures and fan speeds using the jtop utility from the jetson-stats package. I've never seen any possible indication of throttling. If there are more reliable methods of reading throttling status and CPU frequencies, I'd be glad to know.

EDIT: I think throttling can pretty much be ruled out. I installed a kernel module to enable reading from the pmccntr_el0 cycle counter from userland (ensuring minimal overhead) and timed the cycle counts for each execution of the function (i.e. each of the thousands of iterations run by Google Benchmark). I then ran it twice on the Raspberry Pi.

In the first run, Google Benchmark measured 372 microseconds, from which one would expect 372e-6 * 1.5e9 = 558,000 cycles, and indeed virtually all cycle counter measurements hover around this value (the very first run, as one would expect, takes much longer, closer to 800,000 cycles, but Google Benchmark does a "dry run" first before actually measuring, so this isn't even taken into account).

In the second run, Google Benchmark measured 421 us, from which one would expect 421e-6 * 1.5e9 = 631,500 cycles, and again almost all cycle counter measurements hover around this value.

More precisely, the mean for the first run is 558,420 cycles and the standard deviation is 9,536 cycles. The mean for the second run is 631,180 cycles and the standard deviation is 4,431 cycles. Even if the largest standard deviation (from the first run) is considered, this is almost 8 standard deviations of difference between the two means, so it's definitely statistically significant.

END EDIT

A related issue is throttling due to undervoltage, i.e. underpowered PSUs. I've had this issue in the past with the Jetson Nano, and bought a genuine Samsung 25 W USB-C travel charger, capable of 5 V x 3 A output (model EP-TA800XB), which meets the Jetson Nano's specs. This charger comes with a short USB-C to USB-C cable that presumably doesn't drop voltage too much. This was months ago, and ever since replacing the PSU, I've never seen the throttling warning again. However, since I'm running the Jetson Nano headless now, I have no idea how to monitor this from the command-line -- would appreciate if anyone knew how to do so.

As for the Raspberry Pi, while experimenting, I did find out that initially I was using an underpowered PSU, which was confirmed by the vcgencmd command using the get_throttled argument. I then switched to a MacBook Pro's 85 W USB-C power supply, and monitored the throttling status, which didn't happen again: the result of running vcgencmd get_throttled is zero, regardless of which loads I throw at it.

So it looks like throttling due to undervoltage can also be ruled out.
I tried to run perf and collect both a sampled time profile as well as various PMU events to try to find a correlation between different runtimes and one of these events. There were no significant correlations. The number of instructions run is always the same (within a fraction of a percent). Cache and branch prediction misses vary a little, but there is no correlation, such as the execution time increasing with more cache/branch prediction misses.
I tried pinning the process to one core using taskset, but the variability remains.
I tried running the process with nice -n -20 (as root), but the variability remains.

Any ideas of what else I could try?

EDIT: A minimal reproducible example

I finally managed to create a minimal reproducible example. And I apologize for not including a very crucial information in the question regarding shared libraries, which will be made clear soon. On to the MRE.

main.c:

int f();

int main() {
    for (int i = 0; i < 100000000; i++) {
        volatile int temp = f();
        (void) temp;
    }
    
    return 0;
}

f.c:

int f() {
    return 1337;
}

Makefile:

all: main-shared main-static

clean:
    rm -f *.so *.o main-shared main-static

main-shared: libf.so main.c
    gcc -o main-shared main.c libf.so -Wl,-rpath,`pwd`

main-static: f.o main.c
    gcc -o main-static main.c f.o

libf.so: f.c
    gcc -O3 -fPIC -rdynamic -shared f.c -o libf.so

f.o: f.c
    gcc -O3 -fPIC f.c -c -o f.o

Here is a typical run output:

swineone@jetson-nano:~/repro$ make
gcc -O3 -fPIC -rdynamic -shared f.c -o libf.so
gcc -o main-shared main.c libf.so -Wl,-rpath,`pwd`
gcc -O3 -fPIC f.c -c -o f.o
gcc -o main-static main.c f.o
swineone@jetson-nano:~/repro$ time ./main-shared

real    0m0,883s
user    0m0,880s
sys 0m0,000s
swineone@jetson-nano:~/repro$ time ./main-shared

real    0m0,621s
user    0m0,612s
sys 0m0,004s
swineone@jetson-nano:~/repro$ time ./main-static

real    0m0,478s
user    0m0,476s
sys 0m0,000s
swineone@jetson-nano:~/repro$ time ./main-static

real    0m0,523s
user    0m0,520s
sys 0m0,000s
swineone@jetson-nano:~/repro$ time ./main-static

real    0m0,527s
user    0m0,520s
sys 0m0,004s
swineone@jetson-nano:~/repro$ time ./main-static

real    0m0,477s
user    0m0,472s
sys 0m0,004s
swineone@jetson-nano:~/repro$ time ./main-static

real    0m0,504s
user    0m0,500s
sys 0m0,000s

Indeed, I ran time ./main-static 100 times and the minimum/maximum execution times were 477/530 ms. This is higher than I'd like, but may have to do with the short execution time along with the inevitable other processes that are running. But this is still considerably better from the 621/883 ms split of the shared library version.

Thinking this may have to do with the short runtimes, I increased the number of iterations of the main loop tenfold. The problem still doesn't go away:

swineone@jetson-nano:~/repro$ time ./main-shared 

real    0m6,099s
user    0m6,088s
sys 0m0,000s
swineone@jetson-nano:~/repro$ time ./main-shared 

real    0m8,692s
user    0m8,672s
sys 0m0,000s

I also ran time ./main-static 10 times. The minimum/maximum run times were 4.754/5.332 s. So, results appear to be independent of the number of iterations.

Thus, it appears that the issue has to do with shared libraries, but it's still unclear to me exactly why this is happening. Hopefully this MRE will help someone figure out what is going on.

EDIT: no idea why, but passing the environment variable LD_BIND_NOW seems to remove the variability on the Raspberry Pi 4, but not on the Jetson Nano. This can also be achieved at link time by passing the -Wl,-znow flag to gcc when linking.

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/236621/discussion-on-question-by-swineone-large-performance-variability-of-fully-cpu-bo). — Machavity, Aug 31 '21 at 19:45

score -1 · Answer 1 · answered Aug 28 '21 at 23:19

-1

Your operating system can change the clockspeed. Initially, the processor runs at low clockspeed to keep power consumption low. Then the CPU is really used, so the OS decides that running fast is more important than power consumption and increases the clock speed.

answered Aug 28 '21 at 23:19

gnasher729

51,477
5
75
98

OP's point #1 explains that they have (tried to) rule that out, by overriding the scaling governor. – Nate Eldredge Aug 28 '21 at 23:22

Large performance variability of fully CPU-bound code for no clear reason

Context

Factors investigated

EDIT: A minimal reproducible example

1 Answers1

Linked