0

The program

I have a C++ program that looks something like the following:

<load data from disk, etc.>
// Get some buffers aligned to 4 KiB
double* const x_a = static_cast<double*>(std::aligned_alloc(......));
double* const p = static_cast<double*>(std::aligned_alloc(......));
double* const m = static_cast<double*>(std::aligned_alloc(......));
double sum = 0.0;
const auto timerstart = std::chrono::steady_clock::now();
for(uint32_t i = 0; i<reps; i++){
    uint32_t pos = 0;
    double factor;
    if((i%2) == 0) factor = 1.0; else factor = -1.0;
    for(uint32_t j = 0; j<xyzvec.size(); j++){
        pos = j*basis::ndist; //ndist is a compile-time constant == 36
        for(uint32_t k =0; k<basis::ndist; k++) x_a[k] = distvec[k+pos];
        sum += factor*basis::energy(x_a, &coeff[0], p, m);
    }
}
const auto timerstop = std::chrono::steady_clock::now();
<free memory, print stats, etc.>
reger

where reps is a single digit number, xyzvec has ~15k elements, and a single call to basis::energy(...) takes about 100 µs to return. The energy function is huge in terms of code size (~5 MiB of source code that looks something like this, it's from a code generator).

Edit: The m array is somewhat large, ~270 KiB for this test case.

Edit 2: Source code of the two functions responsible for ~90% of execution time

All of the pointers entering energy are __restrict__-qualified and declared to be aligned via __assume_aligned(...), the object files are generated with -Ofast -march=haswell to allow the compiler to optimize and vectorize at will. Profiling suggests the function is currently frontend-bound (L1i cache miss, and fetch/decode).

energy does no dynamic memory allocation or IO, and mostly reads/writes x_a, m and p, x_a is const, which are all aligned to 4k page boundaries. Its execution time ought to be pretty consistent.

The strange timing behaviour

Running the program many times, and looking at the time elapsed between the timer start/stop calls above, I have found it to have a strange bimodal distribution.

  1. Calls to energy are either "fast" or "slow", fast ones take ~91 µs, slow ones take ~106 µs on an Intel Skylake-X 7820X.
  2. All calls to energy in a given process are either fast or slow, the metaphorical coin is flipped once, when the process starts.
  3. The process is not quite random, and can be heavily biased towards the "fast" cases, by purging all kernel caches via echo 3 | sudo tee /proc/sys/vm/drop_caches immediately before execution.
  4. The random effect may be CPU dependent. Running the same executable on a Ryzen 1700X yields both faster and much more consistent execution. The "slow" runs either don't happen or their prominence is much reduced. Both machines are running the same OS. (Ubuntu 20.04 LTS, 5.11.0-41-generic kernel, mitigations=off)

What could be the cause?

  1. Data alignment (dubious, the arrays intensively used are aligned)
  2. Code alignment (maybe, but I have tried printing the function pointer of energy, no correlation with speed)
  3. Cache aliasing?
  4. JCC erratum?
  5. Interrupts, scheduler activity?
  6. Some cores turbo boosting higher? (probably not, tried launching it bound to a core with taskset and tried all cores one by one, could not find one that was always "fast")
  7. ???

Edit

  1. Zero-filling x_a, p and m before first use appears to make no difference to the timing pattern.
  2. Replacing (i % 2) with factor *= -1.0 appears to make no difference to the timing pattern.
uLoop
  • 225
  • 1
  • 8
  • Off-topic: Place the `size()` result into a `const` temporary variable. Compiler may perform this or not, depending on the optimization level and the compiler's analyzation ability. Hopefully the compiler will place the temporary variable into a register. – Thomas Matthews Dec 18 '21 at 01:17
  • Off-topic: You may want to replace `pos = j*basis::ndist` with `pos += basis::ndist;` The goal is to replace the multiplication with repeated additions. IMHO, processors still prefer addition to multiplication. – Thomas Matthews Dec 18 '21 at 01:19
  • Your timings will depend on whether arrays are in the data cache or not. If the processor needs to reload the data cache, you will get a slower execution time. If you can, place all single variables into a single struct, then use a `vector` of that struct. This will help keep useful data in the cache (or hopefully on the same cache line). – Thomas Matthews Dec 18 '21 at 01:21
  • Replace `(i % 2)` with `factor = (-1) * factor;`. Processors really dislike branches. At best, the processor has to invoke branch decision logic and jump within the instruction cache. At worst, the processor invokes branch decision logic, then reloads the instruction cache. Review your code to see if you can replace branch statements by multiplication or Boolean algebra. – Thomas Matthews Dec 18 '21 at 01:25
  • @ThomasMatthews Most of the action is happening between m and p, which are freshly allocated. The timings are consistent within any given process, it does not matter how many times I call energy. The randomness comes into play whenever I relaunch the executable. – uLoop Dec 18 '21 at 01:27
  • Looks like you are at the mercy of the OS. Which OS are you using? Is it a real-time OS (with accuracy on deadlines)? – Thomas Matthews Dec 18 '21 at 01:29
  • @ThomasMatthews As I have stated in the post, Ubuntu 20.04. I am not looking for absolute consistency, but the reason why on *some* launches of the program, all calls to ´energy´ are doomed to execute 10-15% slower for no apparent reason. – uLoop Dec 18 '21 at 01:33
  • Capsule summary of this question: "calls to some large, complicated, function (whose code is not shown) have undesired performance characteristic, why?" I'd say the chances of someone hitting on answer, here, are slim to none. The answer must lie somewhere in that function, and bear no relevance to any code in whatever calls it. – Sam Varshavchik Dec 18 '21 at 01:48
  • @SamVarshavchik I have shown a small excerpt of the function, there really is not much to see there, except megabytes of repetitive machine generated expressions, but I am adding a paste of the two functions which are responsible for ~90% of the execution time of `energy`. – uLoop Dec 18 '21 at 02:05
  • 1
    It's impossible to say why you're seeing variability of speed. Your OS is not real-time (so any other process or the OS itself could do things outside your control). You're using dynamic memory allocation/release, which increases the number of factors (e.g. memory fragmentation) that could unpredictably hurt performance. You are calling a large body of code with unspecified performance characteristics or (worse) unspecified interactions with other code or resources. In short, you're trying to retrofit predictable timing into code with timing characteristics that you don't understand. – Peter Dec 18 '21 at 02:19
  • @Peter There are no syscalls or other library calls in `energy` or any of its descendants in the call tree. I have timed batches of 15000 calls of `energy` and 100 000+, the result is the same, either they are fast or all slow. If the OS was randomly screwing around with my process occasionally, I would have seen variance within the same process, but I have not. – uLoop Dec 18 '21 at 02:28
  • 3
    I really don't like speculating around without looking at the data and data you do not have. I would suggest placing a few performance counters (cycles, cache misses, branch misses, etc) before and after the `energy()` call with something like this: https://github.com/HFTrader/tiny-cpp-perf-stats/blob/master/Snapshot.h Check if there is a correlation between time and cache/branch misses then go exploring other counters. –  Dec 18 '21 at 03:06
  • 1
    @Jellyboy Fair enough, I will probably gather actual numbers in a few days and update the question. – uLoop Dec 18 '21 at 05:03
  • "could not find one that was always "fast"" Turbo-boost do not boost a specific core. It boost blocks of core using a dynamic strategy. If a thread use only one code, the turbo can boost the target core whatever its number. The turbo is lower with more core working so to mitigate both power and heating (the algorithm is complex and only partially published). Besides this, using low-level profiler like `perf` or VTune should help to find the problem. If this does not help, then please provide a *minimal working* reproducible example. – Jérôme Richard Dec 18 '21 at 14:01
  • There are many other things that could explain this variation like page faults, the frequency scaling due to wide SIMD instruction (typically with AVX-512), the branch prediction (as pointed out in other comments), the schedule of read-write of cache-line in RAM. The instruction scheduler and instruction decoding can also have an impact although they should likely not cause a 10% variation on their own. Moreover, note that over-aligning data is not always better due to cache thrashing (the cache associativity is limited). – Jérôme Richard Dec 18 '21 at 14:10
  • @JérômeRichard Intel Skylake-X CPUs have "Turbo Boost Max 3.0", which means there are 2 designated cores (the ones with the highest silicon quality) which can boost 200 MHz higher than the rest. The program was compiled to only use up to AVX2, has low FLOPs/byte, and the vast majority of time (according to `perf`) is spent in completely branchless functions. – uLoop Dec 18 '21 at 15:26
  • It's hard to say what could be causing the speed changes. I wouldn't be able to resist using [*this technique*](https://stackoverflow.com/a/378024/23771) to get the performance down to optimal, before asking your question. – Mike Dunlavey Dec 18 '21 at 15:51
  • @MikeDunlavey That would not be very informative in this case, sadly. I already know from profiling where most of the time is spent...it is nearly uniformly distributed over megabytes of machine generated code that has no loops, branches or library/system calls. – uLoop Dec 18 '21 at 16:35
  • For the frequency scaling, this is a [bit more complex](https://stackoverflow.com/a/56861355/12939557) but based on what you said I do not expect this to be the issue. I just saw the generated code meanwhile and I do not see how the processor can execute this quickly. The decoding and instruction cache are *very likely* to be the issue here (saturated) regarding the HUGE code. I just toke a 70 simple lines and built them with GCC+`-O3 -march=avx2` and it takes 1KiB for only 120 instructions. It is worse with `-O2`. The performance of the decoding and instruction cache are often not very stable – Jérôme Richard Dec 18 '21 at 16:58
  • 1
    @JérômeRichard Yes indeed, the code size is huge and L1i misses are an issue according to VTune. I am also trying to rewrite parts of the code generator to produce loops when it can, but I need to measure if that actually makes `energy` faster or slower, which is how I got frustrated with this variance in execution time. – uLoop Dec 18 '21 at 17:10
  • The immediate constants are not free. Reading the indices from memory and using loops can be better. Perhaps you have to write your own optimizer. Or (locally) prepare the data before calculation. – Sebastian Dec 18 '21 at 17:32

0 Answers0