Trying to translate a C function to x86_64 AT&T assembly

Question

I've been trying to translate this function to assembly:

void foo (int a[], int n) {
  int i;
  int s = 0;
  for (i=0; i<n; i++) {
    s += a[i];
    if (a[i] == 0) {
      a[i] = s;
      s = 0;
    }
  }
}

But something is going wrong.

That's what I've done so far:

.section .text
.globl foo
foo:
.L1:
    pushq %rbp 
    movq %rsp, %rbp 
    subq $16, %rsp 

    movl $0, -16(%rbp) /*s*/ 
    movl $0, -8(%rbp) /*i*/

    jmp .L2

.L2:
    cmpl -8(%rbp), %esi 
    jle .L4 

    leave
    ret

.L3:
    addl $1, -8(%rbp) 
    jmp .L2

.L4:
    
    movl -8(%rbp), %eax 
    imull $4, %eax 
    movslq %eax, %rax 
    
    addq %rdi, %rax 
    
    movl (%rax), %eax 
    addl %eax, -16(%rbp) 

    cmpl $0, %eax
    jne .L3 

    /*      if      */
    leaq (%rax), %rdx 
    
    movl -16(%rbp), %eax 
    movl %eax, (%rdx) 
    movl $0, -16(%rbp) 
    jmp .L3

I am compiling the .s module with a .c module, for example, with an int nums [5] = {65, 23, 11, 0, 34} and I'm getting back the same array instead of {65, 23, 11 , 99, 34}.

Could someone help me?

At the `if` you already destroyed `rax` by loading the `a[i]` into `eax` which is even known to be `0` if you get to that point. I am surprised you even get anything back instead of a crash. EDIT: you don't even get that far because you got the condition at `.L2` reversed. PS: learn to use a debugger. — Jester, Oct 08 '20 at 00:50
@Lucy: Are you aware that `leaq (%rax), %rdx` is just an inefficient way of doing `mov %rax, %rdx`? I have a feeling you are expecting it to do something else, but I'm not sure what that is. — Nate Eldredge, Oct 08 '20 at 01:01
Moving that `lea` up to before the `movl (%rax), %eax` and reversing the `jle` to `jg` fixes the code. — Jester, Oct 08 '20 at 01:10

Brett Hale · Answer 1 · 2020-10-08T09:03:18.927

Presumably you have a compiler that can generate AT&T syntax. It might be more instructive to look at what assembly output the compiler generates. Here's my re-formulation of your demo:

#include <stdio.h>

void foo (int a[], int n)
{
    for (int s = 0, i = 0; i < n; i++)
    {
        if (a[i] != 0)
            s += a[i];
        else
            a[i] = s, s = 0;
    }
}

int main (void)
{
    int nums[] = {65, 23, 11, 0, 34};
    int size = sizeof(nums) / sizeof(int);

    foo(nums, size);
    for (int i = 0; i < size; i++)
        fprintf(stdout, i < (size - 1) ? "%d, " : "%d\n", nums[i]);

    return (0);
}

Compiling without optimizations enabled is typically harder to work through than optimized code, since it loads from and spills results to memory. You won't learn much from it if you're investing time in learning how to write efficient assembly.

Compiling with the Godbolt compiler explorer with -O2 optimizations yields much more efficient code; it's also useful for cutting out unnecessary directives, labels, etc., that would be visual noise in this case.

In my experience, using -O2 optimizations are clever enough to make you rethink your use of registers, refactoring, etc. -O3 can sometimes optimize too agressively - unrolling loops, vectorizing, etc., to easily follow.

Finally, for the case you have presented, there's a perfect compromise: -Os, which enables many of the optimizations of -O2, but not at the expense of increased code size. I'll paste the assembly here just for comparative purposes:

foo:
        xorl    %eax, %eax
        xorl    %ecx, %ecx
.L2:
        cmpl    %eax, %esi
        jle     .L7
        movl    (%rdi,%rax,4), %edx
        testl   %edx, %edx
        je      .L3
        addl    %ecx, %edx
        jmp     .L4
.L3:
        movl    %ecx, (%rdi,%rax,4)
.L4:
        incq    %rax
        movl    %edx, %ecx
        jmp     .L2
.L7:
        ret

Remember that the calling convention passes the pointer to (a) in %rdi, and the 'count' (n) in %rsi. These are the calling conventions being used. Notice that your code does not 'dereference' or 'index' any elements through %rdi. It's definitely worth going stepping through the code - even with pen and paper if it helps - to understand the branch conditions and how reading and writing is performed on element a[i].

Curiously, using the inner loop of your code:

s += a[i];
if (a[i] == 0)
    a[i] = s, s = 0;

Appears to generate more efficient code with -Os than the inner loop I used:

foo:
        xorl    %eax, %eax
        xorl    %edx, %edx
.L2:
        cmpl    %eax, %esi
        jle     .L6
        movl    (%rdi,%rax,4), %ecx
        addl    %ecx, %edx
        testl   %ecx, %ecx
        jne     .L3
        movl    %edx, (%rdi,%rax,4)
        xorl    %edx, %edx
.L3:
        incq    %rax
        jmp     .L2
.L6:
        ret

A reminder for me to keep things simple!

One downside to `-Os` is the unfortunate loop structure, with cmp/jcc at the top and a jmp at the bottom. It unfortunately doesn't just the jmp *outside* the loop (jumping to a cmp/jcc at the bottom), to run once instead of every iteration. [Why are loops always compiled into "do...while" style (tail jump)?](https://stackoverflow.com/q/47783926). That would be neutral for code size; seems a silly choice. — Peter Cordes, Oct 08 '20 at 09:06
It can be interesting to compare `-O3 -fno-tree-vectorize -fno-unroll-loops` against `-O2`; GCC is only aggressive about if-conversion to cmov at `-O3`. [gcc optimization flag -O3 makes code slower than -O2](https://stackoverflow.com/q/28875325) and also [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116) GCC doesn't unroll by default at -O3 (except as part of auto-vectorizing), but clang does; fortunately the same options work well for both. (Older clang needs `-fno-vectorize`, not `...-tree-vectorize`; newer clang has GCC compat option support) — Peter Cordes, Oct 08 '20 at 09:24
@PeterCordes - yes, I might have over-sold the `-Os` option here. The `-O2` code logic might be a little easier to follow. The `-O3` assembly appears identical. As for clang, I hate its excessive default unrolling and vectorization. It's heuristics seem a bit 'off' in many cases. — Brett Hale, Oct 08 '20 at 09:30
Often `-O1` or even `-Og` is good for a slightly more faithful to the source version, but also disables some peephole optimizations. `-Os` is worth considering for beginners, although with its current choices for loop structure I'd have a hard time recommending `gcc -Os`. `clang -Os` loops sane here. https://godbolt.org/z/eWMq5s. Turns out gcc, clang, and ICC can't find a way to auto-vectorize this; not surprising really for a prefix sum (serial dependency between very element) *and* has conditional stores, so would need AVX512 masked stores to not suck, along with shuffling. — Peter Cordes, Oct 08 '20 at 09:34

Trying to translate a C function to x86_64 AT&T assembly

1 Answers1