MSVC C++ compiler giving inconsistent performance on different builds of the same code when using [[unlikely/likely]] attributes

Question

I am simply experimenting with the [[unlikely/likely]] branch hints available when compiling with /std:c++latest.

To do this I am using a modified version of the code provided by http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0479r0.html at the bottom under Appendix A

#include <array>
#include <cstdint>
#include <random>
#include <chrono>

std::uint16_t clamp(int i) {
    if (i < 0) [[unlikely]] {
        return 0;
    }
    else if (i > 0xFFFF) {
        return 0xFFFFu;
    }

    return i;
}

int main() {
    std::mt19937 gen(42);
    std::bernoulli_distribution d(.001), up_down(.5);

    std::vector<int> data;
    for (std::size_t i = 0; i != 1000000; ++i) {
        if (d(gen)) {
            data.push_back(up_down(gen) ? -1 : 0xFFFFF);
        }
        else {
            data.push_back(1);
        }
    }

    auto Prev = std::chrono::high_resolution_clock::now();

    std::uint32_t result = 0;
    for (int i = 0; i != 10000; ++i) {
        for (auto val : data) {
            result += clamp(val);
        }
    }

    auto After = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> Elapsed = After - Prev;
    std::chrono::milliseconds Time = std::chrono::duration_cast<std::chrono::milliseconds>(Elapsed);
    std::cout << (Time.count()) << '\n';

    return result > 5 ? 1 : 2;
}

Note all compilation was done in x64 release mode with /O2 on the latest version of Visual Studio 2019 (16.10.3) and on a Intel(R) Core(TM) i7-7700HQ 2.8Ghz CPU.

I then tested how using [[unlikely]], [[likely]] and no attribute would effect the time taken.

Testing [[unlikely]]] first I found that on my system it took ~6750ms to run. I then tested [[likely]], which also took ~6750ms to run. At first I thought this was probably just because the attributes are hints and thus the compiler can just ignore them. I then tested no attribute (simply not using [[unlikely/likely]]) and the code ran in ~9000ms. If the compiler was ignoring the attributes previously then it should give the same time of 6750ms, right?

I then went back to test [[likely]] and found that it now ran in ~9000ms??? At this point I was really confused so went back to check [[unlikely]] and found that it ran in ~6750ms. Checking [[likely]] again it now ran at ~6750ms.

This was super strange and after some digging looking through the generated assemblies I found that sometimes the generated assemblies would be 6000+ lines and sometimes under 2000 lines with subsequent compilations??? This is the same for subsequent compilations of the same code. After some more googling I found a peculiar comment:

Why godbolt generate different asm output than my actual asm code in Visual Studio?

And now I am even more confused, why would MSVC generate different code?

Going back to my experiment, I found these results based on compilation order:

No attribute => [[unlikely]] => [[likely]]

~9000ms => ~6750ms => ~6750ms and continues ~6750ms for subsequent [[likely]] compilations and switching back to [[unlikely]] gets the same result whilst switching back to No attribute gets ~9000ms.
No attribute => [[likely]] => [[unlikely]]

~9000ms => ~9000ms => ~9000ms and subsequent [[unlikely]] compilations gets the same ~9000ms result, this is the same when switching to [[likely]] and only exhibits the behaviour shown in the first sequence when compiled after compiling No attribute.

To make sure this wasn't some random quirk I ran these test multiple times without compilation at each step (simply by running the .exe file generated from File Explorer) and got consistent results. I even saved each different .exe file from each different sequence of compilations and ran them again getting the same bizarre results.

I am yet to compare the .exe bytes, but I want to avoid doing so as I am not very well versed in that area "^-^

Am I missing something, is this a bug? It all seems so strange how inconsistent everything is.

I should probably mention that I do not know if this is specific to my system or not and have not been able to test this on a different PC. — name, Jul 06 '21 at 19:59
I compiled it with `g++`, and the attribute was ignored as well (Note: I didn't benchmark it, but went through the decompiled assembly). I thought that it was optimised away, so I ran it with `-O0`, but no change. — Lala5th, Jul 07 '21 at 18:41
@Lala5th I'm fairly sure this is because `g++` and `msvc` have different syntax for the `[[unlikely/likely]]` tags (not sure though, simply saw differences online), so if you placed it in the same position as I have then it may simply be ignored. I think you have to place it at the end (on the last line of the if statement) rather than in front for `g++`. My main problem though, is the inconsistent results depending on compilation order which only seems to be a problem in `msvc`. — name, Jul 08 '21 at 17:58
@name [the attributes `[[unlikely/likely]]` are standard](https://en.cppreference.com/w/cpp/language/attributes/likely) so obviously the syntax must be the same and both attributes must behave the same in all standard compliant compilers — phuclv, Jul 18 '21 at 00:50
@name: the [likely-unlikely tag wiki](https://stackoverflow.com/tags/likely-unlikely/info) and popup info previous suggested some wrong ideas about hinting runtime branch prediction. They don't, but might affect the compilers choice to use a branch or go branchless. (which can make a difference, like in [gcc optimization flag -O3 makes code slower than -O2](https://stackoverflow.com/q/28875325)) — Peter Cordes, Jul 18 '21 at 01:56
Your code doesn't compile, it's missing some headers like `#include `. — Peter Cordes, Jul 18 '21 at 01:59
Looks like MSVC uses a similar strategy for the inner loop either way, using `test`/`jns` or `js` for the `<0` clamping and `cmp`/`cmovg` for the other condition. I did "reveal linked code" on the loop, then moused over the clamp function's statements to highlight the relevant lines. But the one difference is using `js` (to a block at the bottom that does movzx and jumps back) else fall into cmp/cmov, vs. having it all contiguous with a `jns` over a movzx/jmp. https://godbolt.org/z/GKsxvqocn It could optimize away a later zero-extension, e.g. by using 32-bit operand-size `cmov`, though :/ — Peter Cordes, Jul 18 '21 at 02:10
@PeterCordes Thanks, I've included the missing header now. I'm kind of lost by your other comment though (I'm not very well versed in assembly). My main problem was simply just the results being inconsistent with subsequent compilations. — name, Jul 18 '21 at 02:55
How repeatable is this? If you run the same executable repeatedly, does it ever switch from being slow to fast or vice versa? Or if you run different ones in sequence, does it always have them at the same speeds for each one? It might be some quirk of ASLR in loading the program or how it allocates memory during startup, leading to different prefetching or branch prediction or something. Probably not CPU-frequency effects since the total time is long enough that it should hide the early start-up time, although if your laptop heats up enough that could affect max turbo. — Peter Cordes, Jul 18 '21 at 03:56
Can you repro this bi-modal performance effect with multiple copies of the *same* executable? (Which Windows might cache differently, in case that matters). Perhaps only some builds are susceptible to the effect (for example only the [[likely]] one), if there's one that *always* runs at 9s instead of 6.75. Or it could be that the effect is totally unrelated to the code differences, and instead just a matter of branch predictors learning from one run and that affecting another (if it happens to run on the same core). But they should re-learn in milliseconds. — Peter Cordes, Jul 18 '21 at 03:57
Also, unless you run the next one *right away* after, e.g. from a script, Windows will probably put the CPU core into a deep sleep if it stays idle for a few hundred ms or whatever the threshold is, and that powers down the per-core caches including branch prediction. — Peter Cordes, Jul 18 '21 at 04:01
@PeterCordes `If you run the same executable repeatedly, does it ever switch from being slow to fast or vice versa?` No, running the same executable repeatedly yields consistent results. `Or if you run different ones in sequence, does it always have them at the same speeds for each one?` Yes, for example if I run a `likely` version compiled from the 6750ms group it will continue to run at 6750ms even after running a `likely` version compiled from the 9000ms group. The executables seem to individually have consistent results. — name, Jul 18 '21 at 05:27
@PeterCordes As to why two executables compiled from the same code gives different results boggles my mind haha. — name, Jul 18 '21 at 05:27
@PeterCordes `Can you repro this bi-modal performance effect with multiple copies of the same executable? (Which Windows might cache differently, in case that matters)` Each copy of a given executable yields the same result as the original, so I don't think this is the case. As for branch predictors learning from one run and affecting the other, I doubt this is the case as each executable is individually consistent and if I run the executables in different orders (even upon system restart) results remain consistent. — name, Jul 18 '21 at 05:31
Are the .exe files bytewise-identical (like same MD5 sum)? If not, it could be that the *compiler* is randomizing something that affects relative alignment of different pieces of code, which in turn affects branch prediction (whether two branches alias the same predictor entries or not). — Peter Cordes, Jul 18 '21 at 05:34
The other plausible mechanism (for identical files) is for mapping the pages from disk cache into process address space to matter somehow, so two copies of the same file bytes could possibly perform differently is something about disk cache pages for one of them is different. This would produce persistent / consistent behaviour within one set of runs (while the pages stay hot in RAM), but could change after reading some large files or suspend/resume or whatever to force Windows to re-read them from disk. — Peter Cordes, Jul 18 '21 at 05:35
Ok, your last comment mostly rules out the identical-file hypothesis. It was a bit of a stretch, and might not be a real effect ever. (And I can't think of a specific mechanism that would cause such effects.) — Peter Cordes, Jul 18 '21 at 05:37
@PeterCordes `Or it could be that the effect is totally unrelated to the code differences` On a super surface level comparison the bytes of 2 executables compiled with the same code (just from a different compilation sequence) are very different, but I'll have to read up on how windows packs its executables to be sure. — name, Jul 18 '21 at 05:39
To create this effect (of branch aliasing or not), whatever MSVC is randomizing is presumably variable-length, changing the offsets (and run-time addresses) of everything later in the file. So you'd expect most of the bytes to match *at some small offset* (except for ones that depend on relative or absolute address within the file), but a byte-for-byte compare that only looks at the same positions would find almost every byte different. Comparing text disassembly of the hot loop would be most relevant, to see changes in address alignment. — Peter Cordes, Jul 18 '21 at 05:41
Oh, could be something unrelated to branch prediction, like one of the new performance potholes Intel has introduced into your CPU with microcode updates which compiles don't avoid by default (yet). [Code alignment dramatically affects performance](https://stackoverflow.com/q/67442222) - try with MSVC's `/QIntel-jcc-erratum` option to actively avoid that problem; if that makes executables that are always-fast, then that's it. — Peter Cordes, Jul 18 '21 at 05:45
@PeterCordes I don't think a bytewise-identical hash would identify if 2 executables are the same since I think windows packs its executables in some sort of way resulting in them being different anyway. However I think the compiler is definitely doing something but I don't think its completely random since the sequence described in the question always result in those given results. — name, Jul 18 '21 at 05:49
@PeterCordes I'll try /QIntel-jcc-erratum now, but on a side note compiling with /Gy- seems to make `[[likely/unlikely]]` work properly?? As in they now give results as expected and are consistent regardless of compilation sequence. — name, Jul 18 '21 at 05:51
@PeterCordes Compiling with /QIntel-jcc-erratum seems to make `[[likely/unlikely]]` irrelevant and the code always runs in 6750ms regardless of the tags. — name, Jul 18 '21 at 05:53
Ok, that confirms that theory, then. We already knew that either (or no) attribute *could* lead to fast code, and there were only two speeds. So neither branch layout to minimize front-end bottlenecks mattered, as long as the code can run from the uop cache. (Allowing generally high uop throughput, although still potentially not as good as the back-end). — Peter Cordes, Jul 18 '21 at 06:02
@PeterCordes I see, so if I am understanding this correctly this was not a problem with branch prediction but with how the program was cached? (I'm not too sure what you mean by uop cache, with a quick Google seems like some microarchitecture shenanigans). — name, Jul 18 '21 at 06:06
@PeterCordes I assume /Gy also caused similar problems which was why /Gy- seemed to fix them. I should note that with /Gy- the `unlikely` tag resulted in 6750ms and `likely` with 9000ms (Note this was consistent to the tags now rather than compilation sequence) unlike the consistent 6750ms which /QIntel-jcc-erratum gave. — name, Jul 18 '21 at 06:09
And BTW, better compilers than MSVC can auto-vectorize that code, especially if SSE4.1 or AVX2 are available for packed min/max of 32-bit integers are available. e.g. clang `-O3 -march=skylake` https://godbolt.org/z/6n5cP3YWr should be running at about 8 `int` elements per clock cycle (or bottlenecked on mem bandwidth), branchlessly so not data-dependent at all, with no chance of branch misprediction. — Peter Cordes, Jul 18 '21 at 06:09
Yes, exactly, nothing to do with branch prediction, just microarchitectural shenanigans. The uop cache caches decoded instructions. x86 machine code is hard (and energy-intensive) to decode, so CPUs try to avoid it in tight loops. And switching back and forth between uop cache and legacy decode introduces extra problems. [32-byte aligned routine does not fit the uops cache](https://stackoverflow.com/a/61016915) links to a Phoronix article about the problem. `/Gy-` presumably just changed function layout in such a way that no case of alignment put a branch from the hot loop on a 32-byte bo — Peter Cordes, Jul 18 '21 at 06:13
@PeterCordes I see, thanks a lot for helping to solve this mystery! — name, Jul 18 '21 at 06:13
IDK whether to post an answer. I don't know why or how MSVC randomizes code alignment on different builds, and that + the JCC erratum is the answer. It's not an exact duplicate of *just* those JCC erratum questions, though. Or is it? [Code alignment dramatically affects performance](https://stackoverflow.com/q/67442222) does talk about Visual Studio making loops with different alignment. Yeah, I think it's a duplicate. — Peter Cordes, Jul 18 '21 at 06:16
@PeterCordes It might be a duplicate, I simply lacked the knowledge beforehand to tell whether this was a problem with code alignment (I didn't even realise this could be problem!). If you don't post an answer I might answer my own question summarising what was discussed in these comments. — name, Jul 18 '21 at 06:19
Oh yeah, that doesn't reflect badly on you, of course you wouldn't have found that duplicate. I was just thinking out loud about whether to close as a duplicate *now that we have identified that was the problem* (and checking with you about whether there were still any unanswered parts of the question that would merit a separate answer). I didn't change my upvote after deciding it was a duplicate, because this question could possibly be a useful signpost to point future searchers at the right answer. — Peter Cordes, Jul 18 '21 at 06:26
Speaking of signposts, I edited the question title with the factors that are truly indicative of this problem. [[likely]]/[[unlikely]] isn't actually relevant, except that someone playing around with them is likely to have code with branches inside a key loop, other than just the loop branch. — Peter Cordes, Jul 18 '21 at 06:30
@PeterCordes Yup sounds good! I appreciate the help :D Thanks for taking the time to clear up this mess haha. — name, Jul 18 '21 at 06:36

MSVC C++ compiler giving inconsistent performance on different builds of the same code when using [[unlikely/likely]] attributes

0 Answers0