0

for example

void f()
{
   wchar_t s[]=L"aaaaaaaaa";
}

is compiled into something like

.section .rdata
LC0:.ascii "a\0a\0a\0a\0a\0a\0\0"
.section .text
movl LC0,%eax
movl %eax,0x888(%esp)
...

Is it possible to avoid the dependence on another section? Such as

movl $0x00610061,0x888(%esp);movl $0x00610061,0x88c(%esp);...

elflyao
  • 367
  • 3
  • 9

1 Answers1

1

You forgot to enable optimization so bad code is to be expected. Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?

gcc -O3 -m32 does use mov-immediate for this.

(Use volatile so the array doesn't optimize away, of course. Or pass a pointer to it to a non-inline function).

# gcc9.3 -m32 -O3
f():
        sub     esp, 48
        mov     DWORD PTR [esp+8], 97
        mov     DWORD PTR [esp+12], 97
        mov     DWORD PTR [esp+16], 97
        mov     DWORD PTR [esp+20], 97
        mov     DWORD PTR [esp+24], 97
        mov     DWORD PTR [esp+28], 97
        mov     DWORD PTR [esp+32], 97
        mov     DWORD PTR [esp+36], 97
        mov     DWORD PTR [esp+40], 97
        mov     DWORD PTR [esp+44], 0
        add     esp, 48
        ret

64-bit code copies in 16-byte chunks. (Unfortunately not using a broadcast-load even if SSE3 or AVX are available). https://godbolt.org/z/AsdbWU

That's pretty clearly worth it, although movabs with a 64-bit immediate and four qword stores + 1 dword would have been not terrible.

# gcc9.3 -O3 -march=skylake
# with the default tuning / arch options, same code but without "v"
f():
        vmovdqa xmm0, XMMWORD PTR .LC0[rip]      # should have used vpbroadcastd
        vmovaps XMMWORD PTR [rsp-56], xmm0         # it chooses two 16-byte stores
        vmovaps XMMWORD PTR [rsp-40], xmm0         # maybe to avoid a vzeroupper or alignment isn't known
        mov     QWORD PTR [rsp-24], 97           # scalar mov-immediate for the last one
        ret
.LC0:
        .quad   416611827809
        .quad   416611827809
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Seems like OP's `wchar_t` is 2 bytes, which would make the code more similar to [this](https://godbolt.org/z/igvGjh) (but yeah, basically same story). – Marco Bonelli Apr 24 '20 at 04:48
  • @MarcoBonelli: oh right, on Windows `wchar_t` is 2 bytes. But yeah, same story, fortunately the answer doesn't hinge on that fact. – Peter Cordes Apr 24 '20 at 04:52