1

I am currently learning how to convert C code into assembly x86 64 bit code. The code I have is:

long add_arrays(long *arr1, long *arr2, unsigned long num, long *result)
{
    unsigned long sum = 0, i = 0; 
    while(i < num) {
        result[i] = arr1[i] + arr2[i]; 
        sum = sum + result[i]; 
        i++;
    }
    return sum; 
}

When I convert it into assembly, I get an output of either 0 or an address number. My code is:

        .global add_arrays_asm

add_arrays_asm:
        xor %r8, %r8

while_start:
        cmp %r8, %rdx
        jge while_break

        movq (%rdi, %r8, 8), %r9 
        addq (%rsi, %r8, 8), %r9 
        movq (%rcx, %r8, 8), %r10 
        movq (%r9, %r8, 8), %rcx 
        addq (%r10, %r8, 8), %rax 
        addq (%rcx,%r8, 8),  %rax
        inc %r8
        jmp while_start

while_break:
        ret

And the code I'm using to test it is:

printf("Testing add_arrays_asm\n");
    long l3[] = {3, 23, 32, 121, 0, 43, 95, 4};
    long l4[] = {-823,12,-1222,84834, -328, 0, 9, -1387};
    long res1[] = {0, 0, 0, 0, 0, 0, 0, 0}; 
    long sum1 = add_arrays_asm(l3, l4, 8, res1); 
    int j = 0; 
    for(j = 0; j < 8; j++) {
        printf("%8ld + %8ld = %8ld\n", l3[j], l4[j], res1[j]); 
    }
    printf("        Sum = %ld\n\n", sum1); 

I am new to coding in assembly so I cannot find where the code is going wrong. Any help would be appreciated. Thank you.

  • 1
    `movq (%rcx, %r8, 8), %r10`: What's the point of loading the value of `result[i]` into a register since it's just garbage that you intend to overwrite (but you never actually do it, another bug)? `movq (%r9, %r8, 8), %rcx`: The comment on this is inaccurate: it is actually something like `rcx = r9[i]`. Since `r9` contains the result of adding `arr1[i]+arr2[i]` and this is not a pointer, this line makes no sense and is likely to crash. – Nate Eldredge Apr 18 '21 at 17:21

1 Answers1

4

Your conditional jump isn't correct.

    cmp %r8, %rdx
    jge while_break

This means

if (rdx >= r8)
    break;

So you directly break the loop without calculating anything. Also, you should use jae instead of jge for unsigned comparison, as you declared it as unsigned long num. Though it may not be a problem if the value isn't overflowing max signed integer such that it be negative.

Another thing your load and store don't seem to be correct. Beside that, you better keep memory access minimal, no need to reload from memory if you don't really need it.

Also, you forgot to zero your %rax, you should have sum = 0 at the beginning of your calculation.

Fix for this

I reworked your Assembly code.

func.S


.section .text
.global add_arrays_asm

add_arrays_asm:
    #rdi -> arr1
    #rsi -> arr2
    #rdx -> num
    #rcx -> result
    #r8 -> i

    xorl    %r8d, %r8d          # i = 0
    xorl    %eax, %eax          # sum = 0
while_start:
    cmpq    %rdx, %r8           # if (i >= num)
    jae     while_break         #   goto while_break

    movq    (%rdi, %r8, 8), %r9 # r9  = arr1[i]
    addq    (%rsi, %r8, 8), %r9 # r9 += arr2[i]
    movq    %r9, (%rcx, %r8, 8) # result[i] = r9
    addq    %r9, %rax           # sum += r9
    incq    %r8                 # i++
    jmp     while_start         # goto while_start

while_break:
    ret

main.c


#include <stdio.h>

long add_arrays_asm(long *arr1, long *arr2, unsigned long num, long *result);

int main()
{
    printf("Testing add_arrays_asm\n");
    long l3[] = {3, 23, 32, 121, 0, 43, 95, 4};
    long l4[] = {-823,12,-1222,84834, -328, 0, 9, -1387};
    long res1[] = {0, 0, 0, 0, 0, 0, 0, 0}; 
    long sum1 = add_arrays_asm(l3, l4, 8, res1); 
    int j = 0; 
    for(j = 0; j < 8; j++) {
        printf("%8ld + %8ld = %8ld\n", l3[j], l4[j], res1[j]); 
    }
    printf("        Sum = %ld\n\n", sum1); 
}

Execution

ammarfaizi2@integral:/tmp/testz$ gcc -Wall -Wextra func.S main.c -o main
ammarfaizi2@integral:/tmp/testz$ valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes ./main
==748530== Memcheck, a memory error detector
==748530== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==748530== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==748530== Command: ./main
==748530== 
Testing add_arrays_asm
       3 +     -823 =     -820
      23 +       12 =       35
      32 +    -1222 =    -1190
     121 +    84834 =    84955
       0 +     -328 =     -328
      43 +        0 =       43
      95 +        9 =      104
       4 +    -1387 =    -1383
        Sum = 81416

==748530== 
==748530== HEAP SUMMARY:
==748530==     in use at exit: 0 bytes in 0 blocks
==748530==   total heap usage: 1 allocs, 1 frees, 1,024 bytes allocated
==748530== 
==748530== All heap blocks were freed -- no leaks are possible
==748530== 
==748530== For lists of detected and suppressed errors, rerun with: -s
==748530== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
ammarfaizi2@integral:/tmp/testz$ 
Ammar Faizi
  • 1,393
  • 2
  • 11
  • 26
  • 1
    @Bob: See also [Why are loops always compiled into "do...while" style (tail jump)?](https://stackoverflow.com/q/47783926) - compilers will slightly transform your loop so it can have a conditional branch at the *bottom*, and no `jmp`. https://godbolt.org/z/T4jdMEzaM shows GCC and clang -O3 (with auto-vectorization and loop-unrolling disabled) compiling this way - clang's is clearly better than GCC's (not wasting instructions copying pointer args), although I'd have xor-zeroed EAX ahead of the n==0 test to avoid a separate loop-exit path. – Peter Cordes Apr 18 '21 at 23:41
  • @PeterCordes wew, that's surprising. I don't see any good reason for GCC to copy those registers. At a glance, I think it tries to minimize the usage of `r8` and `r9`. – Ammar Faizi Apr 19 '21 at 00:53
  • @AmmarFaizi: Yeah, bizarre to me, too. Seems to be a GCC10 regression vs. GCC9. Other than the wasted copy, it already needs a REX prefix (for 64-bit operand-size) on the instructions in the loop that use those regs, which is why it normally prefers "legacy" registers, but it still makes no sense. (Same r8, r9 copy if you let it auto-vectorize, then `movdqu` only needs a REX for the regs in the addressing mode). – Peter Cordes Apr 19 '21 at 01:15
  • @PeterCordes oh right, GCC 9.3 looks pretty much identical with Clang 12.0.0 for this case. Hopefully the change between GCC 9.3 and 10.x is intentionally by design and must have a rationale (we hope). – Ammar Faizi Apr 19 '21 at 02:08
  • 2
    Talking about xor-zeroed EAX, using `if` statement and `goto` before the `while` loop helps GCC to generate single `xor %eax, %eax` and only single `ret` (that's good). But not for Clang. https://godbolt.org/z/x3To3jbE5 – Ammar Faizi Apr 19 '21 at 02:09
  • 1
    @AmmarFaizi: Very unlikely to be intentional, much more likely a regression introduced by some other optimization pass that helps something else, but nobody noticed that it hurt this case. Do you want to report it on https://gcc.gnu.org/bugzilla (with keyword "missed-optimization"), or should I? Oh nevermind, it's already fixed on (trunk). https://godbolt.org/z/7vxMPxfcq. So GCC11, or maybe 10.4, should have the fix. Interesting that you were able to get GCC to avoid unnecessary tail-duplication by making the source more like the asm you wanted, instead of letting an optimization pass do it – Peter Cordes Apr 19 '21 at 02:18