I have the following piece of code:
std::vector<double> a(1000000), b(1000000);
// fill a with random doubles
for (int i = 0; i < b.size(); ++i) {
b[i] = a[i]*a[i] + a[i];
}
std::cout << b[0] << std::endl; // make sure the compiler doesn't optimize the loop away
The generated assembly for the for
loop (g++ -O3
):
movupd xmm1, XMMWORD PTR[rbp + 0 + rax]
movapd xmm0, xmm1
mulpd xmm0, xmm1
addpd xmm0, xmm1
movups XMMWORD PTR[r12 + rax], xmm0
add rax, 16
cmp rax, 8000000
jne.L17
I can comprehend the assembly up to the following line:
movups XMMWORD PTR[r12 + rax], xmm0
Here the result of a[i] * a[i] + a[i]
is written back to memory (in a vectorized fashion). But why is the movups
(single precision) instruction invoked rather than movupd
(double precision)?