How do I translate an optimized x86-64 asm loop back to a C for loop?

Question

I have the following:

foo:
   movl $0, %eax                      //result = 0
   cmpq %rsi, %rdi                    // rdi = x, rsi = y?
   jle .L2

.L3:
   addq %rdi, %rax                    //result = result + i?
   subq $1, %rdi                      //decrement?
   cmp %rdi, rsi
   jl .L3

.L2
   rep
   ret

And I'm trying to translate it to:

long foo(long x, long y)
{
    long i, result = 0;
    for (i=     ;               ;         ){

      //??

   }

 return result;
}

I don't know what cmpq %rsi, %rdi mean. Why isn't there another &eax for long i?

I would love some help in figuring this out. I don't know what I'm missing - I been going through my notes, textbook, and rest of the internet and I am stuck. It's a review question, and I've been at it for hours.

Your first task is figuring out the calling convention - what is `x`, what is `y`. — Antti Haapala -- Слава Україні, Apr 30 '19 at 19:10
Answer: `for (i = 0xDEADBEEF; x > y; x --) { result += x; }` — Antti Haapala -- Слава Україні, Apr 30 '19 at 19:18
to show that the question is stupid. There is colon missing from after the label and basically `rsi` would be a variable name... — Antti Haapala -- Слава Україні, Apr 30 '19 at 19:55

Brian Walker · Answer 1 · 2019-04-30T19:16:11.103

3

Assuming this is a function taking 2 parameters. Assuming this is using the gcc amd64 calling convention, it will pass the two parameters in rdi and rsi. In your C function you call these x and y.

long foo(long x /*rdi*/, long y /*rsi*/)
{
    //movl $0, %eax
    long result = 0;  /* rax */

    //cmpq %rsi, %rdi
    //jle .L2
    if (x > y) {
        do {
            //addq %rdi, %rax
            result += x;

            //subq $1, %rdi
            --x;

            //cmp %rdi, rsi
            //jl .L3            
        } while (x > y);
    }

    return result;
}

edited Apr 30 '19 at 19:16

answered Apr 30 '19 at 19:11

Brian Walker

8,658
2
33
35

Would it be possible to incorporate `long i` and keep everything in the 'for' loop? – Soph Kwon Apr 30 '19 at 19:55
@Ian: The point is that your assembly code doesn't translate to a for loop using `long i`. It translates to a do-while loop. – Mark Benningfield Apr 30 '19 at 20:02
@MarkBenningfield: But that *is* how a C `for()` loop compiles to asm, as an idiomatic loop with a conditional branch at the *bottom*. [Why are loops always compiled into "do...while" style (tail jump)?](//stackoverflow.com/q/47783926). If the compiler can't prove that the loop condition is true the first time, you get a cmp/jcc to skip over the loop entirely if it needs to run zero times. This could well be `gcc -O1` output (except for the missing `%` on `rsi`, and the missing `:` on `.L2`) Gcc doesn't look for the xor-zeroing peephole optimization at `-O1`, but will do normal loop stuff. – Peter Cordes Apr 30 '19 at 20:49

score 2 · Answer 2 · answered Apr 30 '19 at 22:47

I don't know what cmpq %rsi, %rdi mean

That's AT&T syntax for cmp rdi, rsi. https://www.felixcloutier.com/x86/CMP.html

You can look up the details of what a single instruction does in an ISA manual.

More importantly, cmp/jcc like cmp %rsi,%rdi/jl is like jump if rdi<rsi. Assembly - JG/JNLE/JL/JNGE after CMP. If you go through all the details of how cmp sets flags, and which flags each jcc condition checks, you can verify that it's correct, but it's much easier to just use the semantic meaning of JL = Jump on Less-than (assuming flags were set by a cmp) to remember what they do.

(It's reversed because of AT&T syntax; jcc predicates have the right semantic meaning for Intel syntax. This is one of the major reasons I usually prefer Intel syntax, but you can get used to AT&T syntax.)

From the use of rdi and rsi as inputs (reading them without / before writing them), they're the arg-passing registers. So this is the x86-64 System V calling convention, where integer args are passed in RDI, RSI, RDX, RCX, R8, R9, then on the stack. (What are the calling conventions for UNIX & Linux system calls on i386 and x86-64 covers function calls as well as system calls). The other major x86-64 calling convention is Windows x64, which passes the first 2 args in RCX and RDX (if they're both integer types).

So yes, x=RDI and y=RSI. And yes, result=RAX. (writing to EAX zero-extends into RAX).

From the code structure (not storing/reloading every C variable to memory between statements), it's compiled with some level of optimization enabled, so the for() loop turned into a normal asm loop with the conditional branch at the bottom. Why are loops always compiled into "do...while" style (tail jump)? (@BrianWalker's answer shows the asm loop transliterated back to C, with no attempt to form it back into an idiomatic for loop.)

From the cmp/jcc ahead of the loop, we can tell that the compiler can't prove the loop runs a non-zero number of iterations. So whatever the for() loop condition is, it might be false the first time. (That's unsurprising given signed integers.)

Since we don't see a separate register being used for i, we can conclude that optimization reused another var's register for i. Like probably for(i=x;, and then with the original value of x being unused for the rest of the function, it's "dead" and the compiler can just use RDI as i, destroying the original value of x.

I guessed i=x instead of y because RDI is the arg register that's modified inside the loop. We expect that the C source modifies i and result inside the loop, and presumably doesn't modify it's input variables x and y. It would make no sense to do i=y and then do stuff like x--, although that would be another valid way of decompiling.

cmp %rdi, %rsi / jl .L3 means the loop condition to (re)enter the loop is rsi-rdi < 0 (signed), or i<y.

The cmp/jcc before the loop is checking the opposite condition; notice that the operands are reversed and it's checking jle, i.e. jng. So that makes sense, it really is same loop condition peeled out of the loop and implemented differently. Thus it's compatible with the C source being a plain for() loop with one condition.

sub $1, %rdi is obviously i-- or --i. We can do that inside the for(), or at the bottom of the loop body. The simplest and most idiomatic place to put it is in the 3rd section of the for(;;) statement.

addq %rdi, %rax is obviously adding i to result. We already know what RDI and RAX are in this function.

Putting the pieces together, we arrive at:

long foo(long x, long y)
{
    long i, result = 0;
    for (i= x    ;    i>y    ;    i-- ){
        result += i;
    }

    return result;
}

Which compiler made this code?

From the .L3: label names, this looks like output from gcc. (Which somehow got corrupted, removing the : from .L2, and more importantly removing the % from %rsi in one cmp. Make sure you copy/paste code into SO questions to avoid this.)

So it may be possible with the right gcc version/options to get exactly this asm back out for some C input. It's probably gcc -O1, because movl $0, %eax rules out -O2 and higher (where GCC would look for the xor %eax,%eax peephole optimization for zeroing a register efficiently). But it's not -O0 because that would be storing/reloading the loop counter to memory. And -Og (optimize a bit, for debugging) likes to use a jmp to the loop condition instead of a separate cmp/jcc to skip the loop. This level of detail is basically irrelevant for simply decompiling to C that does the same thing.

The rep ret is another sign of gcc; gcc7 and earlier used this in their default tune=generic output for ret that's reached as a branch target or a fall-through from a jcc, because of AMD K8/K10 branch prediction. What does `rep ret` mean?

gcc8 and later will still use it with -mtune=k8 or -mtune=barcelona. But we can rule that out because that tuning option would use dec %rdi instead of subq $1, %rdi. (Only a few modern CPUs have any problems with inc/dec leaving CF unmodified, for register operands. INC instruction vs ADD 1: Does it matter?)

gcc4.8 and later put rep ret on the same line. gcc4.7 and earlier print it as you've shown, with the rep prefix on the line before.

gcc4.7 and later like to put the initial branch before the mov $0, %eax, which looks like a missed optimization. It means they need a separate return 0 path out of the function, which contains another mov $0, %eax.

gcc4.6.4 -O1 reproduces your output exactly, for the source shown above, on the Godbolt compiler explorer

# compiled with gcc4.6.4 -O1 -fverbose-asm
foo:
        movl    $0, %eax        #, result
        cmpq    %rsi, %rdi      # y, x
        jle     .L2       #,
.L3:
        addq    %rdi, %rax      # i, result
        subq    $1, %rdi        #, i
        cmpq    %rdi, %rsi      # i, y
        jl      .L3 #,
.L2:
        rep
        ret

So does this other version which uses i=y. Of course there are many things we could add that would optimize away, like maybe i=y+1 and then having a loop condition like x>--i. (Signed overflow is undefined behaviour in C, so the compiler can assume it doesn't happen.)

// also the same asm output, using i=y but modifying x in the loop.
long foo2(long x, long y) {
  long i, result = 0;
  for (i= y    ;    x>i    ;    x-- ){
      result += x;
   }
   return result;
}

In practice the way I actually reversed this:

I copy/pasted the C template into Godbolt (https://godbolt.org/). I could see right away (from the mov $0 instead of xor-zero, and from the label names) that it looked like gcc -O1 output, so I put in that command line option and picked an old-ish version of gcc like gcc6. (Turns out this asm was actually from a much older gcc).
I tried an initial guess like x<y based on the cmp/jcc, and i++ (before I'd actually read the rest of the asm carefully at all), because for loops often use i++. The trivial-looking infinite-loop asm output showed me that was obviously wrong :P
I guessed that i=x, but after taking a wrong turn with a version that did result += x but i--, I realized that i was a distraction and at first simplified by not using i at all. I just used x-- while first reversing it because obviously RDI=x. (I know the x86-64 System V calling convention well enough to see that instantly.)
After looking at the loop body, the result += x and x-- were totally obvious from the add and sub instructions.
cmp/jl was obviously a something < something loop condition involving the 2 input vars.
I wasn't sure I if it was x<y or y<x, and newer gcc versions were using jne as the loop condition. I think at that point I cheated and looked at Brian's answer to check it really was x > y, instead of taking a minute to work through the actual logic. But once I had figured out it was x--, only x>y made sense. The other one would be true until wraparound if it entered the loop at all, but signed overflow is undefined behaviour in C.
Then I looked at some older gcc versions to see if any made asm more like in the question.
Then I went back and replaced x with i inside the loop.

If this seems kind of haphazard and slapdash, that's because this loop is so tiny that I didn't expect to have any trouble figuring it out, and I was more interested in finding source + gcc version that exactly reproduced it, rather than the original problem of just reversing it at all.

(I'm not saying beginners should find it that easy, I'm just documenting my thought process in case anyone's curious.)

How do I translate an optimized x86-64 asm loop back to a C for loop?

2 Answers2

Which compiler made this code?

In practice the way I actually reversed this: