Doubts about assembly code translated by gcc for implementation of N factorial by C

Question

I am confued about the disassembly of factorial function.

C code

long factorial(int x)
{
    long result = 1;

    while (x > 1)
    {
        result = result * x;
        x= x - 1;
    }

    return result;
}

I use gcc command to disassembly factorial function gcc -S -O1 test.c

factorial:
.LFB0:
        cmpl    $1, %edi
        jle     .L2
        movslq  %edi, %rdx
        leaq    -1(%rdx), %rcx
        leal    -2(%rdi), %eax
        subq    %rax, %rcx
        movl    $1, %eax
.L3:
        imulq   %rdx, %rax
        subq    $1, %rdx
        cmpq    %rcx, %rdx
        jne     .L3
.L2:
        movl    $1, %eax
        rep ret

I don't understand what the below code for, anyone could help me?

movq    %rax, %rdx
leaq    -1(%rax), %rcx
leal    -2(%rdi), %esi
subq    %rsi, %rcx

FYI, the function is incorrect. It's multiplying by `x` an additional time. — Barmar, Jul 20 '21 at 05:59
Those are setting up and accessing the parameter, `x`, and the local variable, `result`. — Dave M., Jul 20 '21 at 06:00
Weird, looks like GCC7 (which matches your asm https://godbolt.org/z/jMhjsvfdM, later GCC omit the rep prefix) is calculating a loop end condition in RCX in a very convoluted way. Even at `-O2`, so it's not a result of enabling only partial optimization (-O1), although earlier GCC didn't do that. e.g. GCC 4.9 -O1 just compiles as written. https://godbolt.org/z/1PPndxo4v. You could report that as a missed-optimization bug on GCC's bugzilla — Peter Cordes, Jul 20 '21 at 06:26
`movq %rax, %rdx` is preparing a sign-extended copy of `x` as an input for the `result * x` expression which implicitly does `(long)x`. The other 3 instructions are GCC being silly and I think computing a constant `1` — Peter Cordes, Jul 20 '21 at 06:28
Thanks for you reminding, I've updated the comments. @4386427 — Terry Ho, Jul 20 '21 at 08:31

Peter Cordes · Accepted Answer · 2021-07-20T09:17:30.527

(An update to the question changed the C and asm, removing the movq %rax, %rdx that the question still asks about, but otherwise invalidating this first part of the answer. See the edit history or follow the Godbolt links in this answer to see what this section is referring to.)

movq %rax, %rdx is making a copy of the sign-extended x (32-bit int to 64-bit long), for use in the loop in the expression result * x expression which implicitly does (long)x. Notice that it avoids redoing that sign-extension every time through the loop the way the C abstract machine does. (Unlike GCC5 and earlier which compile more or less as written, with only normal transformations like do{}while loop structure.)

The fact that it starts out with 2 copies of the sign-extended x is because your C starts with result=x. That's a bug in your factorial implementation since you don't do x--, but the compiler is just implementing what you wrote. Actually using x-- makes other weird code (https://godbolt.org/z/345K6hbas) like leal -3(%rdi), %edi / addq $1, %rdi which only differs from lea -2(%rdi), %edi in case the LEA produces 0xFFFFFFFF (-1) and qword +1 carries into the high 32 bits. But that can't happen because an earlier cmp/jcc returns early for x-1 <= 1, so that rdi-3+1 is another missed optimization.

The other 3 instructions (lea/lea/sub) are GCC being silly and I think computing a constant 1 in a complicated way as a loop termination condition in RCX, to compare against RDX. This is a missed optimization bug that you can report on GCC's bugzilla since it still happens with current trunk nightly builds at -O2 (https://godbolt.org/z/achGeePYb).

I'm guessing that hoisting the sign-extension resulted created this logic too late for optimization passes to sort it back out into something sensible, or in a way that they can't / don't.

And BTW, this looks like GCC7 since that matches your asm https://godbolt.org/z/jMhjsvfdM. Later GCC omit the rep prefix (but otherwise makes the same mess), earlier GCC either make slightly different asm, or (gcc5 and earlier) fall right into the loop without doing so much first. But they do redo sign-extension of x every loop iteration (from 32-bit int to 64-bit long).

This happens even at -O2, so it's not a result of enabling only partial optimization (-O1). GCC8 and earlier auto-vectorize at -O3, but that's probably not profitable which is hopefully why GCC9 and later stop doing it. (x86 doesn't have SIMD qword multiply until AVX-512, -march=skylake-avx512, and synthesizing it out of multiple pmuludq operations is slow).

Thanks for your great explaination， I already understand what the leaq,leal and subq instructions do after carefully reading, but it's wired that compute a constant value 1 by this way. — Terry Ho, Jul 20 '21 at 08:58

Doubts about assembly code translated by gcc for implementation of N factorial by C

1 Answers1