I apologize for the lack of a minimal reproducible example for the problem that follows. I will continue investigating, and if I isolate it to a small section of code, I will add this information to the question. What I'm really looking for are new avenues of investigation, rather than a precise answer.
Context
I wrote some C code which, when compiled under gcc with -O3
, show considerable variation in execution time from one run to the next on the exact same hardware, upwards of 15%. I am not aware of any possible source of variability across runs, as it's the same binary running on the same hardware using the same "input" (see a few paragraphs below).
The same effect occurs on two different boards. I am not comparing results across boards. The issue is upwards of 15% runtime variability of the exact same code, compiled to the exact same binary by the exact same compiler and optimization options, running in the same hardware, with the same input. I am merely showing that the issue, whatever it is, is not restricted to a single piece of hardware. In fact, pretend I only had one of the two boards below -- I would still be asking exactly the question.
The boards are:
- Nvidia Jetson Nano, 4 x Cortex-A57 @ 1.479 GHz,
gcc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0
- Raspberry Pi 4 model B, 4 x Cortex-A72 @ 1.5 GHz,
gcc (Ubuntu 10.3.0-1ubuntu1) 10.3.0
I use Google Benchmark to measure timings -- note, however, that the variability does not go away when running the code standalone (without Google Benchmark), inside a loop, and timing with the time
shell command.
The code is fully CPU-bound (a cryptographic algorithm). The code is fully deterministic: a pseudorandom number generator is used, but it's always initialized to the same seed. The working set probably fits the L1 D-cache, and certainly fits the L2 cache. Absolutely no I/O is performed at all by the code.
Timings are very repeatable, on both boards, when compiling the code without optimization. Timings are also very repeatable with clang with or without optimization. Clang versions are:
clang version 10.0.0-4ubuntu1~18.04.2
in the Jetson NanoUbuntu clang version 12.0.0-3ubuntu1~21.04.1
in the Raspberry Pi
Thus, the variability only happens with gcc, with -O3
optimizations, with two different boards, and two different versions of gcc.
For reference, this is a log of 30 repetitions of the benchmark, in the Jetson Nano board, for the version compiled with gcc and -O3
optimizations:
--------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------
XXXXXXXXXXXX 433 us 432 us 1628
XXXXXXXXXXXX 431 us 430 us 1628
XXXXXXXXXXXX 418 us 417 us 1628
XXXXXXXXXXXX 414 us 413 us 1628
XXXXXXXXXXXX 414 us 414 us 1628
XXXXXXXXXXXX 414 us 413 us 1628
XXXXXXXXXXXX 389 us 388 us 1628
XXXXXXXXXXXX 366 us 366 us 1628
XXXXXXXXXXXX 366 us 365 us 1628
XXXXXXXXXXXX 367 us 366 us 1628
XXXXXXXXXXXX 368 us 367 us 1628
XXXXXXXXXXXX 366 us 365 us 1628
XXXXXXXXXXXX 366 us 365 us 1628
XXXXXXXXXXXX 366 us 365 us 1628
XXXXXXXXXXXX 366 us 365 us 1628
XXXXXXXXXXXX 366 us 365 us 1628
XXXXXXXXXXXX 367 us 366 us 1628
XXXXXXXXXXXX 366 us 365 us 1628
XXXXXXXXXXXX 366 us 365 us 1628
XXXXXXXXXXXX 366 us 366 us 1628
XXXXXXXXXXXX 366 us 366 us 1628
XXXXXXXXXXXX 366 us 366 us 1628
XXXXXXXXXXXX 366 us 366 us 1628
XXXXXXXXXXXX 366 us 365 us 1628
XXXXXXXXXXXX 366 us 366 us 1628
XXXXXXXXXXXX 366 us 366 us 1628
XXXXXXXXXXXX 366 us 366 us 1628
XXXXXXXXXXXX 366 us 365 us 1628
XXXXXXXXXXXX 366 us 365 us 1628
XXXXXXXXXXXX 366 us 366 us 1628
XXXXXXXXXXXX_mean 378 us 377 us 30
XXXXXXXXXXXX_median 366 us 366 us 30
XXXXXXXXXXXX_stddev 22.4 us 22.4 us 30
Note that there are runs where all timings are close to 430 us, whereas in the next, all timings are close to 365 us, as well as those that start slower and finish faster, as shown. It's also typical to get values close to 415 us, 400 us and 380 us. The behavior in the Raspberry Pi 4 board is quite similar.
EDIT: as requested in the comments section, here is a run with a fixed 10,000 iterations:
-------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------
XXXXXXXXXXXX/iterations:10000 421 us 421 us 10000
XXXXXXXXXXXX/iterations:10000 421 us 421 us 10000
XXXXXXXXXXXX/iterations:10000 421 us 421 us 10000
XXXXXXXXXXXX/iterations:10000 421 us 421 us 10000
XXXXXXXXXXXX/iterations:10000 420 us 420 us 10000
XXXXXXXXXXXX/iterations:10000 420 us 420 us 10000
XXXXXXXXXXXX/iterations:10000 398 us 398 us 10000
XXXXXXXXXXXX/iterations:10000 372 us 372 us 10000
XXXXXXXXXXXX/iterations:10000 372 us 372 us 10000
XXXXXXXXXXXX/iterations:10000 372 us 372 us 10000
XXXXXXXXXXXX/iterations:10000_mean 404 us 404 us 10
XXXXXXXXXXXX/iterations:10000_median 420 us 420 us 10
XXXXXXXXXXXX/iterations:10000_stddev 23.0 us 23.0 us 10
END EDIT
For reference, here is a run in the same board, with clang and -O3
optimizations:
--------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 256 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX 257 us 256 us 2726
XXXXXXXXXXXX 257 us 257 us 2726
XXXXXXXXXXXX_mean 257 us 257 us 30
XXXXXXXXXXXX_median 257 us 257 us 30
XXXXXXXXXXXX_stddev 0.092 us 0.063 us 30
After dozens of runs, the results are always very similar, varying no more than 1 us or so.
Factors investigated
- The results above just scream of dynamic voltage frequency scaling/turbo boost/throttling or similar effects. This was the first avenue of investigation. In the Jetson Nano, I use the
nvpmodel
utility to select the MAXN (10 W) setting, as well as thejetson_clocks
script to force the maximum possible CPU clock. I also usecpufreq-set
from thecpufrequtils
package to override the scaling governor toperformance
. I monitor the CPU clocks, temperatures and fan speeds using thejtop
utility from the jetson-stats package. I've never seen any possible indication of throttling. If there are more reliable methods of reading throttling status and CPU frequencies, I'd be glad to know.
EDIT: I think throttling can pretty much be ruled out. I installed a kernel module to enable reading from the pmccntr_el0
cycle counter from userland (ensuring minimal overhead) and timed the cycle counts for each execution of the function (i.e. each of the thousands of iterations run by Google Benchmark). I then ran it twice on the Raspberry Pi.
In the first run, Google Benchmark measured 372 microseconds, from which one would expect 372e-6 * 1.5e9 = 558,000 cycles, and indeed virtually all cycle counter measurements hover around this value (the very first run, as one would expect, takes much longer, closer to 800,000 cycles, but Google Benchmark does a "dry run" first before actually measuring, so this isn't even taken into account).
In the second run, Google Benchmark measured 421 us, from which one would expect 421e-6 * 1.5e9 = 631,500 cycles, and again almost all cycle counter measurements hover around this value.
More precisely, the mean for the first run is 558,420 cycles and the standard deviation is 9,536 cycles. The mean for the second run is 631,180 cycles and the standard deviation is 4,431 cycles. Even if the largest standard deviation (from the first run) is considered, this is almost 8 standard deviations of difference between the two means, so it's definitely statistically significant.
END EDIT
A related issue is throttling due to undervoltage, i.e. underpowered PSUs. I've had this issue in the past with the Jetson Nano, and bought a genuine Samsung 25 W USB-C travel charger, capable of 5 V x 3 A output (model EP-TA800XB), which meets the Jetson Nano's specs. This charger comes with a short USB-C to USB-C cable that presumably doesn't drop voltage too much. This was months ago, and ever since replacing the PSU, I've never seen the throttling warning again. However, since I'm running the Jetson Nano headless now, I have no idea how to monitor this from the command-line -- would appreciate if anyone knew how to do so.
As for the Raspberry Pi, while experimenting, I did find out that initially I was using an underpowered PSU, which was confirmed by the
vcgencmd
command using theget_throttled
argument. I then switched to a MacBook Pro's 85 W USB-C power supply, and monitored the throttling status, which didn't happen again: the result of runningvcgencmd get_throttled
is zero, regardless of which loads I throw at it.So it looks like throttling due to undervoltage can also be ruled out.
I tried to run
perf
and collect both a sampled time profile as well as various PMU events to try to find a correlation between different runtimes and one of these events. There were no significant correlations. The number of instructions run is always the same (within a fraction of a percent). Cache and branch prediction misses vary a little, but there is no correlation, such as the execution time increasing with more cache/branch prediction misses.I tried pinning the process to one core using
taskset
, but the variability remains.I tried running the process with
nice -n -20
(as root), but the variability remains.
Any ideas of what else I could try?
EDIT: A minimal reproducible example
I finally managed to create a minimal reproducible example. And I apologize for not including a very crucial information in the question regarding shared libraries, which will be made clear soon. On to the MRE.
main.c:
int f();
int main() {
for (int i = 0; i < 100000000; i++) {
volatile int temp = f();
(void) temp;
}
return 0;
}
f.c:
int f() {
return 1337;
}
Makefile:
all: main-shared main-static
clean:
rm -f *.so *.o main-shared main-static
main-shared: libf.so main.c
gcc -o main-shared main.c libf.so -Wl,-rpath,`pwd`
main-static: f.o main.c
gcc -o main-static main.c f.o
libf.so: f.c
gcc -O3 -fPIC -rdynamic -shared f.c -o libf.so
f.o: f.c
gcc -O3 -fPIC f.c -c -o f.o
Here is a typical run output:
swineone@jetson-nano:~/repro$ make
gcc -O3 -fPIC -rdynamic -shared f.c -o libf.so
gcc -o main-shared main.c libf.so -Wl,-rpath,`pwd`
gcc -O3 -fPIC f.c -c -o f.o
gcc -o main-static main.c f.o
swineone@jetson-nano:~/repro$ time ./main-shared
real 0m0,883s
user 0m0,880s
sys 0m0,000s
swineone@jetson-nano:~/repro$ time ./main-shared
real 0m0,621s
user 0m0,612s
sys 0m0,004s
swineone@jetson-nano:~/repro$ time ./main-static
real 0m0,478s
user 0m0,476s
sys 0m0,000s
swineone@jetson-nano:~/repro$ time ./main-static
real 0m0,523s
user 0m0,520s
sys 0m0,000s
swineone@jetson-nano:~/repro$ time ./main-static
real 0m0,527s
user 0m0,520s
sys 0m0,004s
swineone@jetson-nano:~/repro$ time ./main-static
real 0m0,477s
user 0m0,472s
sys 0m0,004s
swineone@jetson-nano:~/repro$ time ./main-static
real 0m0,504s
user 0m0,500s
sys 0m0,000s
Indeed, I ran time ./main-static
100 times and the minimum/maximum execution times were 477/530 ms. This is higher than I'd like, but may have to do with the short execution time along with the inevitable other processes that are running. But this is still considerably better from the 621/883 ms split of the shared library version.
Thinking this may have to do with the short runtimes, I increased the number of iterations of the main loop tenfold. The problem still doesn't go away:
swineone@jetson-nano:~/repro$ time ./main-shared
real 0m6,099s
user 0m6,088s
sys 0m0,000s
swineone@jetson-nano:~/repro$ time ./main-shared
real 0m8,692s
user 0m8,672s
sys 0m0,000s
I also ran time ./main-static
10 times. The minimum/maximum run times were 4.754/5.332 s. So, results appear to be independent of the number of iterations.
Thus, it appears that the issue has to do with shared libraries, but it's still unclear to me exactly why this is happening. Hopefully this MRE will help someone figure out what is going on.
EDIT: no idea why, but passing the environment variable LD_BIND_NOW
seems to remove the variability on the Raspberry Pi 4, but not on the Jetson Nano. This can also be achieved at link time by passing the -Wl,-znow
flag to gcc when linking.