3

I want to learn programming in assembly to write fast and efficient code. How ever I stumble over a problem I can't solve.

I want to loop over an array of double words and add its components like below:

%include "asm_io.inc"  
%macro prologue 0
    push    rbp
    mov     rbp,rsp
    push    rbx
    push    r12
    push    r13
    push    r14
    push    r15
%endmacro
%macro epilogue 0
    pop     r15
    pop     r14
    pop     r13
    pop     r12
    pop     rbx
    leave
    ret
%endmacro

segment .data
string1 db  "result: ",0
array   dd  1, 2, 3, 4, 5

segment .bss


segment .text
global  sum

sum:
    prologue

    mov  rdi, string1
    call print_string

    mov  rbx, array
    mov  rdx, 0
    mov  ecx, 5

lp:
    mov  rax, [rbx]
    add  rdx, rax
    add  rbx, 4
    loop lp

    mov  rdi, rdx
    call print_int
    call print_nl

epilogue

Sum is called by a simple C-driver. The functions print_string, print_int and print_nl look like this:

section .rodata
int_format  db  "%i",0
string_format db "%s",0

section .text
global  print_string, print_nl, print_int, read_int
extern printf, scanf, putchar

print_string:
    prologue
    ; string address has to be passed in rdi
    mov     rsi,rdi
    mov     rdi,dword string_format
    xor     rax,rax
    call    printf
    epilogue

print_nl:
    prologue
    mov     rdi,0xA
    xor     rax,rax
    call    putchar
    epilogue

print_int:
    prologue
    ;integer arg is in rdi
    mov     rsi, rdi
    mov     rdi, dword int_format
    xor     rax,rax
    call    printf
    epilogue

When printing the result after summing all array elements it says "result: 14" instead of 15. I tried several combinations of elements, and it seems that my loop always skips the first element of the array. Can somebody tell me why th loop skips the first element?

Edit

I forgot to mention that I'm using a x86_64 Linux system

muXXmit2X
  • 2,745
  • 3
  • 17
  • 34

1 Answers1

3

I'm not sure why your code is printing the wrong number. Probably an off-by-one somewhere that you should track down with a debugger. gdb with layout asm and layout reg should help. Actually, I think you're going one past the end of the array. There's probably a -1 there, and you're adding it to your accumulator.

If your ultimate goal is writing fast & efficient code, you should have a look at some of the links I added recently to https://stackoverflow.com/tags/x86/info. Esp. Agner Fog's optimization guides are great for helping you understand what runs efficiently on today's machines, and what doesn't. e.g. leave is shorter, but takes 3 uops, compared to mov rsp, rbp / pop rbp taking 2. Or just omit the frame pointer. (gcc defaults to -fomit-frame-pointer for amd64 these days.) Messing around with rbp just wastes instructions and costs you a register, esp. in functions that are worth writing in ASM (i.e. usually everything lives in registers, and you don't call other functions).


The "normal" way to do this would be write your function in asm, call it from C to get the results, and then print the output with C. If you want your code to be portable to Windows, you can use something like

#define SYSV_ABI __attribute__((sysv_abi))
int SYSV_ABI myfunc(void* dst, const void* src, size_t size, const uint32_t* LH);

Then even if you compile for Windows, you don't have to change your ASM to look for its args in different registers. (The SysV calling convention is nicer than the Win64: more args in registers, and all the vector registers are allowed to be used without saving them.) Make sure you have a new enough gcc, that has the fix for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66275, though.

An alternative is to use some assembler macros to %define some register names so you can assemble the same source for Windows or SysV ABIs. Or have a Windows entry-point before the regular one, which uses some MOV instructions to put args in the registers the rest of the function is expecting. But that obviously is less efficient.


It's useful to know what function calls look like in asm, but writing them yourself is a waste of time, usually. Your finished routine will just return a result (in a register or memory), not print it. Your print_int etc. routines are hilariously inefficient. (push/pop every callee-saved register, even though you use none of them, and multiple calls to printf instead of using a single format string ending with a \n.) I know you didn't claim this code was efficient, and that you're just learning. You probably already had some idea that this wasn't very tight code. :P

My point is, compilers are REALLY good at their job, most of the time. Spend your time writing asm ONLY for the hot parts of your code: usually just a loop, sometimes including the setup / cleanup code around it.


So, on to your loop:

lp:
    mov  rax, [rbx]
    add  rdx, rax
    add  rbx, 4
    loop lp

Never use the loop instruction. It decodes to 7 uops, vs. 1 for a macro-fused compare-and-branch. loop has a max throughput of one per 5 cycles (Intel Sandybridge/Haswell and later). By comparison, dec ecx / jnz lp or cmp rbx, array_end / jb lp would let your loop run at one iteration per cycle.

Since you're using a single-register addressing mode, using add rdx, [rbx] would also be more efficient than a separate mov-load. (It's a more complicated tradeoff with indexed addressing modes, since they can only micro-fuse in the decoders / uop-cache, not in the rest of the pipeline, on Intel SnB-family. In this case, add rdx, [rbx+rsi] or something would stay micro-fused on Haswell and later).

When writing asm by hand, if it's convenient, help yourself out by keeping source pointers in rsi and dest pointers in rdi. The movs insn implicitly uses them that way, which is why they're named si and di. Never use extra mov instructions just because of register names, though. If you want more readability, use C with a good compiler.

;;; This loop probably has lots of off-by-one errors
;;; and doesn't handle array-length being odd
mov rsi, array
lea rdx, [rsi + array_length*4]  ; if len is really a compile-time constant, get your assembler to generate it for you.
mov eax, [rsi]   ; load first element
mov ebx, [rsi+4] ; load 2nd element
add rsi, 8       ; eliminate this insn by loading array+8 in the first place earlier
; TODO: handle length < 4

ALIGN 16
.loop:
    add eax, [    rsi]
    add ebx, [4 + rsi]
    add rsi, 8
    cmp rsi, rdx
    jb .loop         ;  loop while rsi is Below one-past-the-end
;  TODO: handle odd-length
add eax, ebx
ret

Don't use this code without debugging it. gdb (with layout asm and layout reg) is not bad, and available in every Linux distro.

If your arrays are always going to be very short compile-time-constant lengths, just fully unroll the loops. Otherwise, an approach like this, with two accumulators, lets two additions happen in parallel. (Intel and AMD CPUs have two load ports, so they can sustain two adds from memory per clock. Haswell has 4 execution ports that can handle scalar integer ops, so it can execute this loop at 1 iteration per cycle. Previous Intel CPUs can issue 4 uops per cycle, but the execution ports will get behind on keeping up with them. Unrolling to minimize loop overhead would help.)

All these techniques (esp. multiple accumulators) are equally applicable to vector instructions.

segment .rodata         ; read-only data
ALIGN 16
array:  times 64    dd  1, 2, 3, 4, 5
array_bytes equ $-array
string1 db  "result: ",0

segment .text
; TODO: scalar loop until rsi is aligned
; TODO: handle length < 64 bytes
lea rsi, [array + 32]
lea rdx, [rsi - 32 + array_bytes]  ;  array_length could be a register (or 4*a register, if it's a count).
; lea rdx, [array + array_bytes] ; This way would be lower latency, but more insn bytes, when "array" is a symbol, not a register.  We don't need rdx until later.
movdqu xmm0, [rsi - 32]   ; load first element
movdqu xmm1, [rsi - 16] ; load 2nd element
; note the more-efficient loop setup that doesn't need an add rsi, 32.

ALIGN 16
.loop:
    paddd  xmm0, [     rsi]   ; add packed dwords
    paddd  xmm1, [16 + rsi]
    add rsi, 32
    cmp rsi, rdx
    jb .loop         ;  loop: 4 fused-domain uops
paddd   xmm0, xmm1
phaddd  xmm0, xmm0     ; horizontal add: SSSE3 phaddd is simple but not optimal.  Better to pshufd/paddd
phaddd  xmm0, xmm0
movd    eax, xmm0
;  TODO: scalar cleanup loop
ret

Again, this code probably has bugs, and doesn't handle the general case of alignment and length. It's unrolled so each iteration does two * four packed ints = 32bytes of input data.

It should run at one iteration per cycle on Haswell, otherwise 1 iteration per 1.333 cycles on SnB/IvB. The frontend can issue all 4 uops in a cycle, but the execution units can't keep up without Haswell's 4th ALU port to handle the add and macro-fused cmp/jb. Unrolling to 4 paddd per iteration would do the trick for Sandybridge, and probably help on Haswell, too.

With AVX2 vpadd ymm1, [32+rsi], you get double the throughput (if the data is in the cache, otherwise you still bottleneck on memory). To do the horizontal sum for a 256b vector, start with a vextracti128 xmm1, ymm0, 1 / vpaddd xmm0, xmm0,xmm1, and then it's the same as the SSE case. See this answer for more details about efficient shuffles for horizontal ops.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • First of all: Thank You :) . I know that the print functions are not really effiecent. I just followed up an tutorial in which they were used. Later this function should process a much bigger array passed from the c-driver. I also considered loop-unrolling to make the loop faster, but the whole thing with the vectorization still seems pretty complicated to me. – muXXmit2X Jul 15 '15 at 06:01
  • Yeah, if you're just learning asm, take some baby steps and get your scalar version working. But think about how you could use an instruction that did four 32bit adds in parallel. If you're going to write in asm, using vector instructions is usually the main reason these days. Otherwise you may get better results just using a compiler that can auto-vectorize it for you. (And look at the asm output if you're curious). Write your source code to use a couple accumulators, to give the compiler that hint about having multiple dep chains running; it's key for short loops with today's wide CPUs. – Peter Cordes Jul 15 '15 at 07:06
  • Definitely use a debugger to step through your code. That should help a lot. – Peter Cordes Jul 15 '15 at 07:09
  • I still have a question about the conditional jump. So your comparing the adresses stored in `rdx` and `rsi` and if rdx is smaller then the adress in rsi the CF-Flag should be set and the `jb` should jump back to `.loop`, right? Shouldn't it then be `cmp rsi, rdx` so while rsi is smaller then rdx jump back to .loop? – muXXmit2X Jul 15 '15 at 07:49
  • In Intel syntax, `cmp rdx, rsi` sets flags based on `rdx - rsi` (but without updating rdx like `sub` would). I am more used to AT&T syntax where operands go in the opposite order, but I think I have this right. Did you try with a debugger? Let me know if I actually got it wrong, since that's possible. :P – Peter Cordes Jul 15 '15 at 08:00
  • I Just tried it, and with `cmp rdx, rcx` it just went through the loop once. After changing it to `cmp rsi, rdx` it worked perfectly ;) – muXXmit2X Jul 15 '15 at 08:12