The program
I have a C++ program that looks something like the following:
<load data from disk, etc.>
// Get some buffers aligned to 4 KiB
double* const x_a = static_cast<double*>(std::aligned_alloc(......));
double* const p = static_cast<double*>(std::aligned_alloc(......));
double* const m = static_cast<double*>(std::aligned_alloc(......));
double sum = 0.0;
const auto timerstart = std::chrono::steady_clock::now();
for(uint32_t i = 0; i<reps; i++){
uint32_t pos = 0;
double factor;
if((i%2) == 0) factor = 1.0; else factor = -1.0;
for(uint32_t j = 0; j<xyzvec.size(); j++){
pos = j*basis::ndist; //ndist is a compile-time constant == 36
for(uint32_t k =0; k<basis::ndist; k++) x_a[k] = distvec[k+pos];
sum += factor*basis::energy(x_a, &coeff[0], p, m);
}
}
const auto timerstop = std::chrono::steady_clock::now();
<free memory, print stats, etc.>
reger
where reps
is a single digit number, xyzvec
has ~15k elements, and a single call to basis::energy(...)
takes about 100 µs to return. The energy
function is huge in terms of code size (~5 MiB of source code that looks something like this, it's from a code generator).
Edit: The m
array is somewhat large, ~270 KiB for this test case.
Edit 2: Source code of the two functions responsible for ~90% of execution time
All of the pointers entering energy
are __restrict__
-qualified and declared to be aligned via __assume_aligned(...)
, the object files are generated with -Ofast -march=haswell
to allow the compiler to optimize and vectorize at will. Profiling suggests the function is currently frontend-bound (L1i cache miss, and fetch/decode).
energy
does no dynamic memory allocation or IO, and mostly reads/writes x_a, m and p, x_a is const, which are all aligned to 4k page boundaries. Its execution time ought to be pretty consistent.
The strange timing behaviour
Running the program many times, and looking at the time elapsed between the timer start/stop calls above, I have found it to have a strange bimodal distribution.
- Calls to
energy
are either "fast" or "slow", fast ones take ~91 µs, slow ones take ~106 µs on an Intel Skylake-X 7820X. - All calls to
energy
in a given process are either fast or slow, the metaphorical coin is flipped once, when the process starts. - The process is not quite random, and can be heavily biased towards the "fast" cases, by purging all kernel caches via
echo 3 | sudo tee /proc/sys/vm/drop_caches
immediately before execution. - The random effect may be CPU dependent. Running the same executable on a Ryzen 1700X yields both faster and much more consistent execution. The "slow" runs either don't happen or their prominence is much reduced. Both machines are running the same OS. (Ubuntu 20.04 LTS, 5.11.0-41-generic kernel, mitigations=off)
What could be the cause?
- Data alignment (dubious, the arrays intensively used are aligned)
- Code alignment (maybe, but I have tried printing the function pointer of
energy
, no correlation with speed) - Cache aliasing?
- JCC erratum?
- Interrupts, scheduler activity?
- Some cores turbo boosting higher? (probably not, tried launching it bound to a core with taskset and tried all cores one by one, could not find one that was always "fast")
- ???
Edit
- Zero-filling x_a, p and m before first use appears to make no difference to the timing pattern.
- Replacing (i % 2) with factor *= -1.0 appears to make no difference to the timing pattern.