(An update to the question changed the C and asm, removing the movq %rax, %rdx
that the question still asks about, but otherwise invalidating this first part of the answer. See the edit history or follow the Godbolt links in this answer to see what this section is referring to.)
movq %rax, %rdx
is making a copy of the sign-extended x
(32-bit int
to 64-bit long
), for use in the loop in the expression result * x
expression which implicitly does (long)x
. Notice that it avoids redoing that sign-extension every time through the loop the way the C abstract machine does. (Unlike GCC5 and earlier which compile more or less as written, with only normal transformations like do{}while loop structure.)
The fact that it starts out with 2 copies of the sign-extended x
is because your C starts with result=x
. That's a bug in your factorial implementation since you don't do x--
, but the compiler is just implementing what you wrote. Actually using x--
makes other weird code (https://godbolt.org/z/345K6hbas) like leal -3(%rdi), %edi
/ addq $1, %rdi
which only differs from lea -2(%rdi), %edi
in case the LEA produces 0xFFFFFFFF (-1) and qword +1 carries into the high 32 bits. But that can't happen because an earlier cmp/jcc returns early for x-1 <= 1
, so that rdi-3+1 is another missed optimization.
The other 3 instructions (lea/lea/sub) are GCC being silly and I think computing a constant 1
in a complicated way as a loop termination condition in RCX, to compare against RDX. This is a missed optimization bug that you can report on GCC's bugzilla since it still happens with current trunk nightly builds at -O2 (https://godbolt.org/z/achGeePYb).
I'm guessing that hoisting the sign-extension resulted created this logic too late for optimization passes to sort it back out into something sensible, or in a way that they can't / don't.
And BTW, this looks like GCC7 since that matches your asm https://godbolt.org/z/jMhjsvfdM. Later GCC omit the rep prefix (but otherwise makes the same mess), earlier GCC either make slightly different asm, or (gcc5 and earlier) fall right into the loop without doing so much first. But they do redo sign-extension of x
every loop iteration (from 32-bit int
to 64-bit long
).
This happens even at -O2
, so it's not a result of enabling only partial optimization (-O1). GCC8 and earlier auto-vectorize at -O3
, but that's probably not profitable which is hopefully why GCC9 and later stop doing it. (x86 doesn't have SIMD qword multiply until AVX-512, -march=skylake-avx512
, and synthesizing it out of multiple pmuludq
operations is slow).