I have a very weird compiler behavior where G++ pulls computations into a hot loop, severly reducing the performance of the resulting code. What is going on here?
Consider this function:
#include <cstdint>
constexpr bool noLambda = true;
void funnyEval(const uint8_t* columnData, uint64_t dataOffset, uint64_t dictOffset, int32_t iter, int32_t limit, int32_t* writer,const int32_t* dictPtr2){
// Computation X1
const int32_t* dictPtr = reinterpret_cast<const int32_t*>(columnData + dictOffset);
// Computation X2
const uint16_t* data = (const uint16_t*)(columnData + dataOffset);
// 1. The less broken solution without lambda
if (noLambda) {
for (;iter != limit;++iter){
int32_t t=dictPtr[data[iter]];
*writer = t;
writer++;
}
}
// 2. The totally broken solution with lambda
else {
auto loop = [=](auto body) mutable { for (;iter != limit;++iter){ body(iter); } };
loop([=](unsigned index) mutable {
int32_t t=dictPtr[data[index]];
*writer = t;
writer++;
});
}
}
The problem here is that G++ somehow loves to pull computations X1
and X2
into the hot main loop, reducing the performance. Here are the details:
The function simply iterates over an array data
, looks up a value in a dictionary dictPtr
and writes it to a target memory location writer
.
data
and dictPtr
are computed at the beginning of the function. It has two flavors for doing so: one with a lambda, one without.
(Note that this function is just a minimal working example of much more complicated code. So please refrain from commenting that the lambda is unnecessary here. I am aware of this fact and in the original code it is necessary, unfortunately.)
The problem when compiling with the latest g++ (tried 8.1 and 7.2, same problem with older g++s as you can see in the godbolt links provided) with high optimization level (-O3 -std=c++14
) is the following:
Solution 2. (noLambda=false
) generates very bad code for the loop, even worse than the "naive" solution, because it assumes that it is a good idea to pull Computations X1 and X2, which are outside of the super hot main loop, into the super hot main loop, making it around 25% slower on my CPU.
.L3:
movl %ecx, %eax # unnecessary extra work
addl $1, %ecx
addq $4, %r9 # separate loop counter (pointer increment)
leaq (%rdi,%rax,2), %rax # array indexing with an LEA
movzwl (%rax,%rsi), %eax # rax+rsi is Computation X2, pulled into the loop!
leaq (%rdi,%rax,4), %rax # rax+rdx is Computation X1, pulled into the loop!
movl (%rax,%rdx), %eax
movl %eax, -4(%r9)
cmpl %ecx, %r8d
jne .L3
When using a usual for loop (noLambda=true
), then the code is better as X2 is no longer pulled into the loop, but X1 still is!:
.L3:
movzwl (%rsi,%rax,2), %ecx
leaq (%rdi,%rcx,4), %rcx
movl (%rcx,%rdx), %ecx # This is Computation X1, pulled into the loop!
movl %ecx, (%r9,%rax,4)
addq $1, %rax
cmpq %rax, %r8
jne .L3
You can try out that this is really X1 in the loop by replacing dictPtr
(the computation X1) in the loop with dictPtr2
(a parameter), the instruction will vanish:
.L3:
movzwl (%rdi,%rax,2), %ecx
movl (%r10,%rcx,4), %ecx
movl %ecx, (%r9,%rax,4)
addq $1, %rax
cmpq %rax, %rdx
jne .L3
This is finally the loop as I want to have it. A simple loop that loads the values and stores the result without pulling random computations into it.
So, what is going on here? It is seldom a good idea to pull a computation into a hot loop, but G++ seems to think so here. This is costing me real performance. The lambda exacerbates the whole situation; it leads G++ to pull even more computations into the loop.
What makes this problem so severe is that this is really trivial C++ code without fancy features. If I cannot rely on my compiler producing perfect code for such a trivial example, I will need to check the assembly of all hot loops in my code to make sure everything is as fast as it could be. This also means that there are probably a huge number of programs affected by this.