You forgot to enable optimization so bad code is to be expected. Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?
gcc -O3 -m32
does use mov-immediate for this.
(Use volatile
so the array doesn't optimize away, of course. Or pass a pointer to it to a non-inline function).
# gcc9.3 -m32 -O3
f():
sub esp, 48
mov DWORD PTR [esp+8], 97
mov DWORD PTR [esp+12], 97
mov DWORD PTR [esp+16], 97
mov DWORD PTR [esp+20], 97
mov DWORD PTR [esp+24], 97
mov DWORD PTR [esp+28], 97
mov DWORD PTR [esp+32], 97
mov DWORD PTR [esp+36], 97
mov DWORD PTR [esp+40], 97
mov DWORD PTR [esp+44], 0
add esp, 48
ret
64-bit code copies in 16-byte chunks. (Unfortunately not using a broadcast-load even if SSE3 or AVX are available). https://godbolt.org/z/AsdbWU
That's pretty clearly worth it, although movabs
with a 64-bit immediate and four qword stores + 1 dword would have been not terrible.
# gcc9.3 -O3 -march=skylake
# with the default tuning / arch options, same code but without "v"
f():
vmovdqa xmm0, XMMWORD PTR .LC0[rip] # should have used vpbroadcastd
vmovaps XMMWORD PTR [rsp-56], xmm0 # it chooses two 16-byte stores
vmovaps XMMWORD PTR [rsp-40], xmm0 # maybe to avoid a vzeroupper or alignment isn't known
mov QWORD PTR [rsp-24], 97 # scalar mov-immediate for the last one
ret
.LC0:
.quad 416611827809
.quad 416611827809