34

I have a very weird compiler behavior where G++ pulls computations into a hot loop, severly reducing the performance of the resulting code. What is going on here?

Consider this function:

#include <cstdint>

constexpr bool noLambda = true;

void funnyEval(const uint8_t* columnData, uint64_t dataOffset, uint64_t dictOffset, int32_t iter, int32_t limit, int32_t* writer,const int32_t* dictPtr2){
   // Computation X1
   const int32_t* dictPtr = reinterpret_cast<const int32_t*>(columnData + dictOffset);
   // Computation X2
   const uint16_t* data = (const uint16_t*)(columnData + dataOffset);
   // 1. The less broken solution without lambda
   if (noLambda) {
        for (;iter != limit;++iter){
            int32_t t=dictPtr[data[iter]];
            *writer = t;
            writer++;
        }
   }
   // 2. The totally broken solution with lambda
   else {
        auto loop = [=](auto body) mutable { for (;iter != limit;++iter){ body(iter); } };
        loop([=](unsigned index) mutable {
            int32_t t=dictPtr[data[index]];
            *writer = t;
            writer++;
        });
   }
}

The problem here is that G++ somehow loves to pull computations X1 and X2 into the hot main loop, reducing the performance. Here are the details:

The function simply iterates over an array data, looks up a value in a dictionary dictPtr and writes it to a target memory location writer. data and dictPtr are computed at the beginning of the function. It has two flavors for doing so: one with a lambda, one without.

(Note that this function is just a minimal working example of much more complicated code. So please refrain from commenting that the lambda is unnecessary here. I am aware of this fact and in the original code it is necessary, unfortunately.)

The problem when compiling with the latest g++ (tried 8.1 and 7.2, same problem with older g++s as you can see in the godbolt links provided) with high optimization level (-O3 -std=c++14) is the following:

Solution 2. (noLambda=false) generates very bad code for the loop, even worse than the "naive" solution, because it assumes that it is a good idea to pull Computations X1 and X2, which are outside of the super hot main loop, into the super hot main loop, making it around 25% slower on my CPU.

https://godbolt.org/g/MzbxPN

.L3:
        movl    %ecx, %eax          # unnecessary extra work
        addl    $1, %ecx
        addq    $4, %r9             # separate loop counter (pointer increment)
        leaq    (%rdi,%rax,2), %rax # array indexing with an LEA
        movzwl  (%rax,%rsi), %eax   # rax+rsi is Computation X2, pulled into the loop!
        leaq    (%rdi,%rax,4), %rax # rax+rdx is Computation X1, pulled into the loop!
        movl    (%rax,%rdx), %eax   
        movl    %eax, -4(%r9)
        cmpl    %ecx, %r8d
        jne     .L3

When using a usual for loop (noLambda=true), then the code is better as X2 is no longer pulled into the loop, but X1 still is!:

https://godbolt.org/g/eVG75m

.L3:
        movzwl  (%rsi,%rax,2), %ecx
        leaq    (%rdi,%rcx,4), %rcx
        movl    (%rcx,%rdx), %ecx # This is Computation X1, pulled into the loop!
        movl    %ecx, (%r9,%rax,4)
        addq    $1, %rax
        cmpq    %rax, %r8
        jne     .L3

You can try out that this is really X1 in the loop by replacing dictPtr (the computation X1) in the loop with dictPtr2 (a parameter), the instruction will vanish:

https://godbolt.org/g/nZ7TjJ

.L3:
        movzwl  (%rdi,%rax,2), %ecx
        movl    (%r10,%rcx,4), %ecx
        movl    %ecx, (%r9,%rax,4)
        addq    $1, %rax
        cmpq    %rax, %rdx
        jne     .L3

This is finally the loop as I want to have it. A simple loop that loads the values and stores the result without pulling random computations into it.

So, what is going on here? It is seldom a good idea to pull a computation into a hot loop, but G++ seems to think so here. This is costing me real performance. The lambda exacerbates the whole situation; it leads G++ to pull even more computations into the loop.

What makes this problem so severe is that this is really trivial C++ code without fancy features. If I cannot rely on my compiler producing perfect code for such a trivial example, I will need to check the assembly of all hot loops in my code to make sure everything is as fast as it could be. This also means that there are probably a huge number of programs affected by this.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
gexicide
  • 38,535
  • 21
  • 92
  • 152
  • 4
    Notice that [gcc-help](https://gcc.gnu.org/ml/gcc-help/) mailing list might be a better place to ask – Basile Starynkevitch May 28 '18 at 17:10
  • I can't speak to your code in particular, but seemingly redundant reloading can happen because the compiler tries to guard against aliasing. – melpomene May 28 '18 at 17:14
  • I am more surprised at lack of vectorization. I guess there isn't a vectorized "load data from N 16 bit offsets of another pointer" and unpacking costs more than the load. – Yakk - Adam Nevraumont May 28 '18 at 17:17
  • @melpomene `-fstrict-aliasing` doesn't seem to make a difference. However, `columnData` is a character pointer, so it could alias the writer output block? – Yakk - Adam Nevraumont May 28 '18 at 17:19
  • [Clang does what I expect](https://godbolt.org/g/JvXioB) with the one that looks a bit more standardized (half-open ranges, no reinterpreting bits next to business logic, etc). It unrolls the loop and picks which unrolling after some initial work. Not vectorized, just flattened. – Yakk - Adam Nevraumont May 28 '18 at 17:33
  • 4
    @melpomene: Aliasing shouldn't be the problem here. If two pointers can alias, then the compiler cannot *elide* a load in a loop. However, there is no load in a loop to be elided here. There are just stack variables and arithmetics that are performed before the loop (neither X1 nor X2 contain a load and even if they did, they would still be in front of the loop, not in the loop). – gexicide May 28 '18 at 17:45
  • 1
    @Yakk-AdamNevraumont: You are right, clang seems to do fine here. Note also, that the code as posted here is not the original code but a minimal working example. In our code, the reinterpretation is also not done in this function but wrapped in an accessor method. However, the compiler inlines this accessor, effectively leading to this code and this problem. So even in more "standardized" code, this can happen due to inlining. – gexicide May 28 '18 at 17:46
  • "making it around 25% slower on my CPU" How did you measure it? By analyzing the assembly code, it seems to me that it should not be slower. – Hadi Brais May 28 '18 at 17:59
  • @HadiBrais Can you explain your analysis? Doing more work in the loop body sure seems to me like it should be slower. –  May 28 '18 at 18:05
  • 1
    @hvd Sure. I think I can write an answer. But I just want to know the basis for this statement "making it around 25% slower on my CPU". In fact, what the compiler has done is a bit better than what OP wants. – Hadi Brais May 28 '18 at 18:09
  • 2
    "I will need to check the assembly of all hot loops in my code to make sure everything is as fast as it could be" -- yes. That is what programming is: write code, check asm, repeat as necessary. – harold May 28 '18 at 21:50
  • 1
    GCC is often dumb and puts an extra register-copy or something inside a loop so it will have the state it wants on loop exit. I've seen that before, but usually only with registers, not with memory. – Peter Cordes May 28 '18 at 22:54
  • 1
    @Yakk-AdamNevraumont: The OP only compiled with `-mavx` and the default `-mtune=generic`, so gather instructions weren't available, and weren't worth using. With `-march=skylake`, a `vpmovzxwd (%rdi), %ymm0` SIMD 16->32 zero-extending load to generate a vector of 32-bit indices for an AVX2 gather (`vpgatherdd`) would have been good, so that's a missed-optimization. Doing a single 256-bit store instead of 8 32-bit stores would free up pipeline throughput to run the gather uops faster. Using `__builtin_assume_aligned` to tell the compiler that the indices and dict are aligned doesn't help :/ – Peter Cordes May 28 '18 at 23:04
  • as well as doing `columnData += dictOffset` before the loop, clang 6 also unrolls the loop 4 times whereas gcc does not (IDK if that is actually a help) – M.M May 29 '18 at 00:14
  • 1
    @M.M: gcc `-O3` doesn't enable `-funroll-loops`, but `-fprofile-use` does (because with PGO it knows which loops are hot and worth spending code-size on.) It should help a lot, especially if the dict stays hot in cache. Haswell and later can do 2 loads + 1 store per clock, which costs 3 uops, but even the optimized loop with `inc` and `cmp/jcc` is a total of 5 uops, so it bottlenecks on the front-end issue throughput of 4 fused-domain uops per clock. – Peter Cordes May 29 '18 at 00:20
  • 1
    Oops, missed the fact that the store is using an indexed addressing mode, so it can't use the AGU on port 7. So the loop can only sustain 2 memory ops per clock, not 3. It does stay micro-fused on Haswell and later, though. (SnB delaminates it into 2 uops.) – Peter Cordes May 29 '18 at 00:33
  • 1
    @HadiBrais: The 25% come from a benchmark that simulates a real workload on our system. The benchmark needs around 40 seconds with the last loop in my example and over 50 with the first loop. Why shouldn't a loop with two extra instructions that even go to memory not be slower than one without them? – gexicide May 29 '18 at 06:48
  • 2
    gexicide: `leaq (%rdi,%rax,2), %rax` is an ALU instruction, not a load. The extra instructions are all ALU. But yes, the bloated versions of the loop will easily bottleneck on front-end throughput on most CPUs, if the dictionary is mostly hot in cache. IDK why @HadiBrais thinks the extra instructions wouldn't matter. What HW are you testing on? On Haswell/Skylake, your tightest loop should run at one per 1.5 cycles (6 uops, and saturating ports 2 and 3), while the worst one in 9 uops (one per ~2.25 cycles, but see [this Q&A](https://stackoverflow.com/q/39311872/)) – Peter Cordes May 29 '18 at 07:02
  • With AVX2 on Skylake, you could vectorize this to run at nearly 2 dictionary lookups per cycle (barring cache misses), better than twice as fast as the scalar loop. (And faster than you could go even with an unrolled + well-optimized scalar loop.) Broadwell gathers take more uops, but might still be worth it. Haswell, possibly worth it but IDK. See http://agner.org/optimize/ for uop counts / throughputs for vpgatherdd. I started typing an answer, but are you interested in an x86 SIMD vectorization, or is your real code too much more complex to just load a vector of indices? – Peter Cordes May 29 '18 at 07:05
  • Anyway, clearly you're hitting a serious missed-optimization bug with gcc. – Peter Cordes May 29 '18 at 07:07
  • @PeterCordes: What worries me so much is that this does not even seem to be a missed optimization to me, it is rather a deoptimization introduced by the compiler. It could even lead to observably different behavior if `writer` and `columnData` aliased (they don't, but the compiler can't know this). We don't compile for AVX2 yet, unfortunately. – gexicide May 29 '18 at 09:35
  • missed-optimization is the gcc bugzilla keyword (https://gcc.gnu.org/bugzilla/) for making sub-optimal code. But it's definitely not a correctness problem. It's "just" re-doing computations using function-arg values in registers. It still does exactly what the source says regardless of overlap. I'm not sure what you're misunderstanding about the asm, but there are still only 3 loads. LEA is an ALU instruction. (And BTW, if this is perf-critical, you might consider runtime dispatching to select an AVX2 version of the function on CPUs where that's faster: probably BDW and newer). – Peter Cordes May 29 '18 at 09:46
  • @PeterCordes Good points. I think that the bottleneck is the dependency chain that includes three memory accesses, not the computation of X1 and X2. The version with `noLambda=false` starts with three simple instructions that can be dispatched in the same cycle, then we hit a 5-instrunction dependency chain. The loop control instructions will be in the way. The indexed store cannot be dispatched to p7. In the `noLambda=true` version, the extra LEA has only a single cycle latency. The compiler could've eliminated it by putting `add %rdi, %rdx` before the loop... – Hadi Brais May 29 '18 at 16:13
  • ...but I doubt that it has any major impact on performance. The dependency chain is still there. In the third version, there are less instructions, but again, there is a dep chain of three memory instructions that can be dispatched to only two ports per cycle. I don't see how can there be 25% penalty between the first and third versions. We need to know the microarch. – Hadi Brais May 29 '18 at 16:13
  • @HadiBrais: The dep chain isn't loop-carried. The only dependent load is the table lookup. All relevant x86 CPUs do out-of-order execution. And the loop buffer in HSW/SKL (but maybe not SnB) can efficiently handle loops that aren't a multiple of 4 uops. [Is performance reduced when executing loops whose uop count is not a multiple of processor width?](https://stackoverflow.com/q/39311872). The bottlenecks are just total-throughput things like the front-end, and load/store ports. The fastest version can saturate p2/p3 with ideal front-end behaviour, but the rest just bottleneck on the fro – Peter Cordes May 29 '18 at 23:47
  • @HadiBrais: Some cache misses could explain only seeing a 25% speedup. Regardless, it's obviously bad code-gen by gcc to put more uops in the loop, making it harder for OoO exec to keep more cache-miss loads in flight to max out memory parallelism for the dict lookups. – Peter Cordes May 29 '18 at 23:49
  • @PeterCordes Yes. All of the versions exhibit the same exact memory access pattern, but the impact of cache latency may vary depending on the number of outstanding loads. This might be a bigger issue for the version with `noLambda=false`, but I don't think this would be that big of an issue for the version with `noLambda=true` (the OP did not measure the perf of this version). Using `-funroll-loops` will probably improve the perf of all the versions, and make the perf of the 2nd and 3rd versions very close. Unrolling removes loop control insts out of the way, reducing the impact of X1/X2. – Hadi Brais May 30 '18 at 00:15
  • @gexicide Can you compare the perf between the second and third version? Also, try with `-funroll-loops`. – Hadi Brais May 30 '18 at 02:11
  • BTW, I just saw this same bug in a loop with compile-time-constant trip count, in [selectively xor-ing elements of a list with AVX2 instructions](https://stackoverflow.com/q/50583718). Pulling an `lea` end-pointer into the cleanup loop. So it's not just that gcc thought the loop might often run 0 times or something. – Peter Cordes May 30 '18 at 13:27
  • Thanks for all the tips how I can get the loop even faster! I appreciate them and will try them out for sure. However, the core of this question is not how to get this particular example as fast as possible. Instead, I am rather atonished how the compiler can produce such deoptimizations in such a simple snippet of code. My main problem now is: I have somehow lost faith in my compiler, so I want to understand how this can happen so I can regain it. – gexicide May 30 '18 at 21:11
  • 5
    Don't have faith in gcc. It's a very complex piece of machinery that often produces good results, but rarely produces *optimal* results. This case is one of the most obvious and simple wrong choices I've seen its optimizer make (and is pretty disappointing), but IDK what kind of answer this question can have other than "it's a bug; report it". – Peter Cordes May 30 '18 at 23:45
  • But many loops bottleneck on memory, not uop throughput. Modern CPUs are designed the chew through the wasted instructions that compilers, especially JIT-compilers, generate. This is why missed-optimizations like this usually don't have a big impact at the macro scale, and why cases where they do matter (e.g. video encoders or matrix multiplies) often still use blocks of hand-written asm. (I'll turn this into an answer at some point if I have time.) – Peter Cordes May 30 '18 at 23:48
  • 1
    @gexicide I agree that it is a de-optimization, but it would not break if `writer` and `columnData` pointed to overlapping space. The code still writes to the correct memory locations, it just recalculates those locations repeatedly (and the wasted calculation only depends on parameter values) – M.M May 31 '18 at 04:12
  • @PeterCordes that would make a good answer in lieu of a specific explanation from someone who worked on gcc's optimizer – M.M May 31 '18 at 04:13
  • 1
    This does not solve your problem, but your code is essentially a `std::transform`: `std::transform(data+iter, data+limit, writer, [=](const uint16_t& t) {return dictPtr[t];});`. In fact you get almost the same behavior: clang unrolls your loop, gcc recomputes `dictPtr`, but works fine with `dictPtr2`. – chtz May 31 '18 at 12:19
  • 5
    It has to do with the cast to unsigned inside your loop. Replace `unsigned index` with `int32_t index` and you get the same assembly as the version without lambda. If you replace `data[iter]` with `data[(unsigned)iter]` in the raw-loop version you get the same output as in the lambda version. So this has not much to do with the lambda, but more with the int -> unsigned. – sbabbi May 31 '18 at 21:43
  • Also important to note that the double array lookup is indistinguishable from a Spectre attack, and its performance is going to get stomped on by the mitigations. – Ben Voigt Jun 02 '18 at 23:29
  • Disagree with the last edit by BeeOnRope; the correct line was marked originally. The leaq is offsetting `columnData` by the value of `data[iter]`; the next line is adding `dictOffset` to the result – M.M Jun 03 '18 at 00:37

3 Answers3

8

You are using an unsigned 32-bits type for the array index (on line 21). This forces the compiler to consider at each step through the loop whether you might have overflowed its available range, in which case it needs to go back to the beginning of the array. The extra code you see is related to this check! There are at least three ways to avoid this over-cautious approach by the compiler:

  1. Use a 64-bit type for index on line 21. Now the compiler knows you will never wrap around the array, and generate the same code as without the lambda.
  2. Use a signed 32-bit type for index on line 21. Now the compiler no longer cares about overflow: signed overflow is considered UB, and therefore ignored. I consider this to be a bug in the interpretation of the standard, but opinions differ on this.
  3. Make it clear to the compiler that an overflow will never occur, by adding a line 'int32_t iter = 0;' at the beginning of the function, and removing iter from the declaration. Clearly this does not solve your problem, but it illustrates how it is the overflow analysis that causes the extra code to be generated.

You aren't complaining about the code before the loop starts, but here you have the same problem. Just make iter and limit int64_t, and you'll see it gets considerably shorter as the compiler no longer considers the possibility of array overflow.

So to recap: it is not the calculation of X1 and X2 that gets moved into the loop that causes the size to balloon, but the use of an incorrectly-typed array index variable.

H. Guijt
  • 3,325
  • 11
  • 16
  • 1
    The interator type is `int32_t`, which is a signed type, so the compiler can assume it doesn't wrap (this is a benefit of the often maligned "signed overflow is undefined" property of C++). – BeeOnRope Jun 01 '18 at 22:54
  • 1
    @BeeOnRope: So GCC has enough information to prove it doesn't need to worry about overflow, but in practice doesn't manage to take advantage. This still doesn't fully explain the sub-optimal code from `if(noLambda)` branch which doesn't involve any `unsigned` vars, but interestingly both X1 and X2 stay out of the loop if you remove that `if` entirely and just use the `noLambda` loop. So there are obviously missed optimizations. This is a useful observation about how to help gcc figure it out and where the bugs might be, though. – Peter Cordes Jun 02 '18 at 00:21
  • Oh yeah, I was talking about the noLambda branch only, I didn't look at the other one. I removed the lambda case like you but observed the `lea` stayed in (see my clang link above last comment to Hadi's answer). – BeeOnRope Jun 02 '18 at 00:25
  • @BeeOnRope The type of the variable that is passed to the lambda (on line 21) is unsigned, so at the point where it matters it is actually unsigned. – H. Guijt Jun 02 '18 at 07:31
  • 1
    @H.Guijt - you are right for the `noLambda == false` case, I was looking at the `noLambda == true ` case. In the lambda case, using a signed `index` parameter helps generate a lot better code (but it still exhibits the `X1` problem the other case does. In that respect, this question has a bit of a split personality: the OP is asking a the code generation for two different pieces of code and the solutions could be totally different. Your answer is great for the lambda case, and helps consolidate the two cases together since they both still exhibit the `X1` problem. – BeeOnRope Jun 02 '18 at 23:03
  • I don't think that this answer describes the core of the problem here. Yes, the code is better with an `int` index, but still, there is one extra instruction in it. In addition, I can get the same fast loop with a lambda and an `unsigned` index by using parameters instead of the computations, as shown here: https://godbolt.org/g/sZUjKm. It still uses `unsigned` but doesn't have any of the extra `lea` instructions. So it cannot be that the `lea` instructions are just because of the `unsigned` loop variable. If the `unsigned` was the real culprit, I shouldn't be able to get the good code. – gexicide Jun 03 '18 at 17:53
  • @gexicide: the code in that link is better, but it still pulls a zero-extension into the loop (`movl %ecx, %eax`). On an Intel CPU, it's 7 fused-domain uops when an optimal loop would be 4 (with no unrolling), and a loop you could expect gcc to come up with would be 5 (if it correctly avoids an indexed store when tuning for Sandybridge-family)). Of course, unrolling would avoid the front-end bottleneck and let it run at 1 store per clock. – Peter Cordes Jun 05 '18 at 11:30
  • `-funroll-loops` is useful to see what gcc thinks is necessary per iteration, and what it just pulled into the loop. https://godbolt.org/g/CZRGyC. (But it makes dumb code, using `lea` to generate index+1, index+2, etc. and feed an indexed addressing mode, instead of using displacements in the addressing mode. I think gcc's addressing-mode cost model must be wrong, separately from the extra LEAs you describe in the question. It does the same thing with 64-bit `lea` with `unsigned long`, but that may still be to get wrap-around.) – Peter Cordes Jun 05 '18 at 11:32
5

Congratulations, you found a gcc bug. The main solution is to report it on GCC's bugzilla with the "missed-optimization" keyword. Your MCVEs are already great test-cases for the bug, so it shouldn't take too long to write one up. Copy/paste the code and some description. A link to this Q&A, and a link to the code on http://godbolt.org/, would be good, too.

Sometimes there are subtle microarchitectural reasons to use "extra" instructions, like xor-zeroing the destination of popcnt/lzcnt or bsf to avoid a false dependency on Intel CPUs, but that's not the case here. This is just bad; the movl %ecx, %eax inside the loop could be the result of using an unsigned type narrower than a pointer, but even that could be done more efficiently; it's a missed optimization, too.

I haven't looked at GCC's GIMPLE or RTL dumps (internal representations) to find out more details. The only use of the calculated values are inside the loop, so I can imaging that the compiler's internal representation of the program logic maybe lost the difference between inside / outside the loop when transforming. Normally things that don't need to be in the loop are hoisted or sunk out of the loop.

But unfortunately it's not rare for gcc to leave an extra mov instruction inside a loop to set up for code outside the loop. Especially when it might take multiple instructions outside the loop to get the same effect. This is usually a bad tradeoff when optimizing for performance instead of code-size. I haven't looked at asm output from Profile-Guided Optimization as much as I'd like, to see code where gcc knows which loops are really hot and unrolls them. But most code gets built without PGO, unfortunately, so code-gen without -fprofile-use is still very important.


However, the core of this question is not how to get this particular example as fast as possible. Instead, I am rather atonished how the compiler can produce such deoptimizations in such a simple snippet of code. My main problem now is: I have somehow lost faith in my compiler, so I want to understand how this can happen so I can regain it.

Don't have faith in gcc! It's a very complex piece of machinery that often produces good results, but rarely produces optimal results.

This case is one of the most obvious and simple wrong choices I've seen its optimizer make (and is pretty disappointing), though. Usually the missed optimizations are somewhat more subtle (and hinge on microarchitectural details like addressing mode choices and uops / execution ports), or at least not so obviously trivial-looking to avoid. (Hoist that one instruction without changing any register allocation for the whole loop.)

But many loops bottleneck on memory, not uop throughput. Modern CPUs are designed the chew through the wasted instructions that compilers, especially JIT-compilers, generate. This is why missed-optimizations like this usually don't have a big impact at the macro scale, and why cases where they do matter (e.g. video encoders or matrix multiplies) often still use blocks of hand-written asm.

Often it's possible to hand-hold gcc into making nice asm by implementing your source in a way that's structured like the asm you want. (Like this case: What is the efficient way to count set bits at a position or lower?, and see Why is this C++ code faster than my hand-written assembly for testing the Collatz conjecture?, for a more general answer about helping the compiler vs. beating the compiler with hand written asm.)

But when your compiler is being braindead like this, there's nothing you can do. Well, except look for workarounds, or stuff like avoiding unsigned integers narrower than pointers which some other answers have pointed out as being important.


Interestingly, the worst case (2 extra LEA instructions in the loop, plus using extra loop counters) only happens with your if (noLambda).

If you make 2 separate versions of the function and remove the if, the nolambda version makes a nice clean loop (but misses auto-vectorization of the gather, which would be a win when compiled with -march=skylake)

I put your code on the Godbolt compiler explorer. (Also interesting, use -funroll-loops to see which parts are redone every unrolled loop iteration, and which are simply inside the loop once.)

# gcc7.2:  the nolamba side of the if, with no actual if()
.L3:
    movzwl  (%rsi,%rax,2), %ecx
    movl    (%rdx,%rcx,4), %ecx
    movl    %ecx, (%r9,%rax,4)   # indexed store: no port 7
    addq    $1, %rax             # gcc8 -O3 -march=skylake uses inc to save a code byte here.
    cmpq    %rax, %r8
    jne     .L3

On Intel Sandybridge-family, this decodes to 5 uops. (Macro-fusion of the cmp/jcc turns that pair into 1 uop. The other instructions are all single-uop; movzwl is a pure load and doesn't need an ALU port).

The store un-laminates on SnB/IvB (costing an extra uop for the 4-wide issue stage, one of the main front-end bottlenecks), but can stay fused on HSW/SKL. It can't use port 7, though (because it's indexed), which meaning HSW/SKL would be limited to 2 memory ops per clock, not 3.

Bottlenecks:

  • Front-end issue bandwidth of 4 fused-domain uops per clock. The loop is 5 uops, and can issue at nearly 1 iteration per 1.25. (Non-multiple-of-4 loops aren't perfect, but 5 uops is handled well on Haswell/Skylake at least. Possibly not Sandybridge.)

  • load / store execution ports: Haswell and later can run 2 loads + 1 store per clock, but only when the store avoids an indexed addressing mode so the store-address uop can run on port 7.


the lambda version gets a 2nd loop counter (pointer increment) and a silly movl %ecx, %eax, but the LEA instructions stay out of the loop.

But it's not the extra computation per-se, it's the total uop throughput that's probably hurting your loop. If the dictionary mostly stays hot in cache, a Haswell or later CPU


I was going to write more, but I didn't finish. Posting now because the generic early/mid part is apparently what the question is really about. Don't blindly trust gcc.

And don't expect that it will make optimal code most of the time. You can often gain 10 or 20% just by tweaking the C source (and sometimes much more). Sometimes gcc just doesn't seem to have a clue, like using extra leas for no apparent reason when unrolling, instead of using a displacement in an addressing mode. I think its addressing-mode cost model must not be accurate, at least for -march=haswell / -march=skylake.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
2

I tried running your code and... surprise: the instructions executed when you are in the loop are not the ones you see in the compiler explorer link you posted. Check this out (I added a main function) https://godbolt.org/g/PPYtQa The instructions executed while you are in the loop are 162-167, i.e.

.L15:
        movzwl  25(%rbx,%rdx), %ecx
        movl    5(%rbx,%rcx,4), %ecx
        movl    %ecx, 0(%rbp,%rdx,2)
        addq    $2, %rdx
        cmpq    $180, %rdx
        jne     .L15

You can double check this by compiling on your machine

g++ test.cpp -std=c++1z -g -O3

and running with gdb

> gdb a.out
(gdb) break funnyEval
(gdb) layout split #shows assebly
(gdb) stepi #steps to the next instruction

The compiler generates a different non-inlined version of funnyEval (which is the one you saw in the disassembled output), even if the one which is actually used is inlined. I have no idea (yet) why the two are different, but I guess that if you are hit by a performance penalty you could fix it by making sure that funnyEval gets inlined: either by defining in a header file or by compiling and linking with link-time optimization (-flto). I'll give it a try to see what happens when funnyEval is in another translation unit...

Paolo Crosetto
  • 1,038
  • 7
  • 17
  • 1
    Your inlined copy of the loop has a known compile-time-constant trip-count, and where the compiler can more easily prove that unsigned wrap-around can't happen. Apparently [that's what was tripping gcc up](https://stackoverflow.com/questions/50570832/why-does-g-pull-computations-into-a-hot-loop/50650621#comment88276672_50570832), and @H. Guijt's answer says more about it. – Peter Cordes Jun 02 '18 at 00:21
  • Peter Cordes is right here, your solution only generates such a good code due to the compile-time constants in the function into which it is inlined. In my program, the function is inlined into another one, but one without compile time constant trip counts. It therefore still yields the bad code. – gexicide Jun 03 '18 at 18:00