The first version does an optimisation by moving a value from memory to a local variable. The second version does not.
I was expecting the compiler might choose to do the localValue optimisation here anyway and not read and write the value from memory for each iteration of the loop. Why doesn't it?
class Example
{
public:
void processSamples(float * x, int num)
{
float localValue = v1;
for (int i = 0; i < num; ++i)
{
x[i] = x[i] + localValue;
localValue = 0.5 * x[i];
}
v1 = localValue;
}
void processSamples2(float * x, int num)
{
for (int i = 0; i < num; ++i)
{
x[i] = x[i] + v1;
v1 = 0.5 * x[i];
}
}
float v1;
};
processSamples assembles to code like this:
.L4:
addss xmm0, DWORD PTR [rax]
movss DWORD PTR [rax], xmm0
mulss xmm0, xmm1
add rax, 4
cmp rax, rcx
jne .L4
processSamples2 to this:
.L5:
movss xmm0, DWORD PTR [rax]
addss xmm0, DWORD PTR example[rip]
movss DWORD PTR [rax], xmm0
mulss xmm0, xmm1
movss DWORD PTR example[rip], xmm0
add rax, 4
cmp rax, rdx
jne .L5
As the compiler doesn't have to worry about threads (v1 isn't atomic). Can't it just assume nothing else will be looking at this value and go ahead and keep it in a register while the loop is spinning?
See https://godbolt.org/g/RiF3B4 for the full assembly and a selection of compilers to choose from!