NASM - Using Labels as Array offsets

Question

I'm trying to write a little program in assembler which takes three char arrays as input, calculates the avarage of each element in the first to arrays and stores the result in the third array like below.

%macro prologue 0
    push    rbp
    mov     rbp,rsp
    push    rbx
    push    r12
    push    r13
    push    r14
    push    r15
%endmacro
%macro epilogue 0
    pop     r15
    pop     r14
    pop     r13
    pop     r12
    pop     rbx
    leave
    ret
%endmacro

segment .data
    offset  db  1
segment .bss
    a1      resq    1
    a2      resq    1
    avg     resq    1
    avgL    resd    1
segment .text
    global  avgArray 
avgArray:
    prologue

    mov [a1], rdi
    mov [a2], rsi
    mov [avg], rdx
    mov [avgL], rcx

    mov rsi, [a1]
    mov r9, [a2]
    mov rdi, [avg]

    mov rcx, rsi
    add rcx, [avgL]    ; array length

    xor rdx, rdx
    xor rax, rax
    xor rbx, rbx
avgArray_loop:
    mov al, [rsi]
    mov dl, [r9]
    add ax, dx
    shr ax, 1
    mov [rdi], al

    add rsi, [offset]
    add r9, [offset]
    add rdi, [offset]

    cmp rsi, rcx
    jb  avgArray_loop
    epilogue

When replacing [offset] with 1 it works perfectly fine. However when using [offset] to determine the next array element it seems that it wont add its value to rsi, rdi and r9. I allready checked it using gdb. The adress stored in rsi is still the same after calling add rsi, [offset].

Can someone tell me why using [offset] wont work but adding a simple 1 does?

By the way: Linux x86_64 machine

Why is offset declared as a single byte if you're going to add it to an 8-byte register? — Michael, Jul 16 '15 at 06:38
As far as I know, that doesn't matter, because it will add `offset` to the lower byte of `rdi`. However I tried declaring `offset` as an QWORD but that changes nothing. In addition my assumption was i just need one byte for the offset, because its value will be 1 so i don't need to reserve more than one byte. — muXXmit2X, Jul 16 '15 at 06:46
@muXXmit2X, Have you tried using the label without the surrounding brackets: (eg. `add rsi, offset`) — l'L'l, Jul 16 '15 at 06:52
_"As far as I know, that doesn't matter, because it will add `offset` to the lower byte of `rdi`."_ There's no `add r64,r/m8` in the instruction set (as far as I can tell). There's `add r/m64, imm8`, but `[offset]` isn't an immediate, it's an `r/m`. — Michael, Jul 16 '15 at 06:56
It's also possible `offset` is a reserved word, so perhaps try something else. — l'L'l, Jul 16 '15 at 06:56
@I'L'I I just tried it out. And it wont work either because the backets are the asm dereferencing operater (similar to `*` in c). Theirfore it will add offsets address instead its value. — muXXmit2X, Jul 16 '15 at 06:58
I found the solution. When writing the array length to `avgL` it overrides the value stored in `offset` because `avgL` was declared as a DWORD but I was writing a QWORD to it. So declaring `avgL` as a QWORD worked solving my problem. — muXXmit2X, Jul 16 '15 at 07:11
Put that as an answer and accept it; it might help someone down the road having a similar question :) — l'L'l, Jul 16 '15 at 07:25

score 3 · Answer 1 · answered Jul 16 '15 at 07:33

So i found the solution for that problem.

The adresses of avgL and offset where stored directly behind each other. When reading from rcx and storing it to avgL it also overwrote the value of offset. Declaring avgL as a QWORD instead of a DWORD prevents mov from overwriting offset data.

The new data and bss segments look like this

segment .data
    offset  db  1
segment .bss
    a1      resq    1
    a2      resq    1
    avg     resq    1
    avgL    resq    1

score 2 · Accepted Answer · edited May 23 '17 at 11:43

Nice work on debugging your problem yourself. Since I already started to look at the code, I'll give you some efficiency / style critique as added comments:

%macro prologue 0
    push    rbp
    mov     rbp,rsp   ; you can drop this and the LEAVE.
;  Stack frames were useful before debuggers could keep track of things without them, and as a convenience
;  so local variables were always at the same offset from your base pointer, even while you were pushing/popping stuff on the stack.
; With the SysV ABI, you can use the red zone for locals without even
; fiddling with RSP at all, if you don't push/pop or call anything.
    push    rbx
    push    r12
    push    r13
    push    r14
    push    r15
%endmacro
%macro epilogue 0
    pop     r15
    pop     r14
    pop     r13
    pop     r12
    pop     rbx
    leave
    ret
%endmacro

segment .data
    offset  db  1
segment .bss    ; These should really be locals on the stack (or in regs!), not globals
    a1      resq    1
    a2      resq    1
    avg     resq    1
    avgL    resd    1

segment .text
; usually a comment with a C function prototype and description is a good idea for functions
    global  avgArray
avgArray:
    prologue

    mov [a1], rdi     ; what is this sillyness?  you have 16 registers for a reason.
    mov [a2], rsi     ; shuffling the values you want into the regs you want them in
    mov [avg], rdx    ; is best done with reg-reg moves.
    mov [avgL], rcx   ; I like to just put a comment at the top of a block of code
                      ; to document what goes in what reg.

    mov rsi, [a1]
    mov r9, [a2]
    mov rdi, [avg]

    mov rcx, rsi
    add rcx, [avgL]    ; This could be lea rcx, [rsi+rcx]
              ;  (since avgL is in rcx anyway as a function arg).

    xor rdx, rdx
    xor rax, rax
    xor rbx, rbx
avgArray_loop:   ; you can use a local label here, starting with a .
 ; You don't need a diff name for each loop: the assembler will branch to the most recent instance of that label
    mov al, [rsi]        ; there's a data dependency on the old value of ax
    mov dl, [r9]         ; since the CPU doesn't "know" that shr ax, 1 will always leave ah zeroed in this algorithm

    add ax, dx           ; Avoid ALU ops on 16bit regs whenever possible.  (8bit is fine, they have diff opcodes instead of a prefix)
                         ; to avoid decode stalls on Intel
    shr ax, 1            ; Better to use 32bit regs (movsx/movzx)
    mov [rdi], al

    add rsi, [offset]    ; These are 64bit adds, so you're reading 7 bytes after the 1 you set with db.
    add r9, [offset]
    add rdi, [offset]

    cmp rsi, rcx
    jb  avgArray_loop
    epilogue

You have tons of registers free, why are you keeping the loop increment in memory? I hope it just ended up that way while debugging / trying stuff.

Also, 1-reg addressing modes are only more efficient when used as mem operands for ALU ops. Just increment a single counter and used base+offset*scale addressing when you have a lot of pointers (unless you're unrolling the loop), esp. if you load them with mov.

Here's how I'd do it (with perf analysis for Intel SnB and later):

scalar

; no storage needed
segment .text
GLOBAL  avgArray
avgArray:
    ; void avgArray (uint8_t *avg, const uint8_t *a1, const uint8_t *a2, size_t len)
    ; if you can choose your prototype, do it so args go where you want them anyway.
    ; prologue
    ; rdi = avg
    ; rsi = a1
    ; rdx = a2
    ; rcx = len

    ; mov    [rsp-8], rcx    ; if I wanted to spill  len  to memory

    add    rcx, rdi
    add    rcx, rsi
    add    rcx, rdx
    neg    rcx       ; now [rdi+rcx] is the start of dest, and we can count rcx upwards towards zero.
    ; We could also have just counted down towards zero
    ; but HW memory prefetchers have more stream slots for forward patterns than reverse.

ALIGN 16
.loop:
    ;  use movsx for signed char
    movzx  eax, [rsi+rcx]     ; dependency-breaker
    movzx  r8d, [rdx+rcx]     ; Using r8d to save push/pop of rbx
           ; on pre-Nehalem where insn decode can be a bottleneck even in tight loops
           ; using ebx or ebp would save a REX prefix (1 insn byte).
    add    eax, r8d
    shr    eax, 1
    mov    [rdi+rcx], al

    inc    rcx     ; No cmp needed: this is the point of counting up towards zero
    jl     .loop   ; inc/jl can Macro-fuse into one uop

    ; nothing to pop, we only used caller-saved regs.
    ret

On Intel, the loop is 7 uops, (the store is 2 uops: store address and store data, and can't micro-fuse), so a CPU that can issue 4 uops per cycle will do it at 2 cycles per byte. movzx (to a 32 or 64bit reg) is 1 uop regardless, because there isn't a port 0/1/5 uop for it to micro-fuse with or not. (It's a read, not read-modify).

7 uops takes 2 chunks of up-to-4 uops, so the loop can issue in 2 cycles. There are no other bottlenecks that should prevent the execution units from keeping up with that, so it should run one per 2 cycles.

vector

There's a vector instruction to do exactly this operation: PAVGB is packed avg of unsigned bytes (with a 9-bit temporary to avoid overflow, same as your add/shr).

; no storage needed
segment .text
GLOBAL  avgArray
avgArray:
    ; void avgArray (uint8_t *avg, const uint8_t *a1, const uint8_t *a2, size_t len)
    ; rdi = avg
    ; rsi = a1
    ; rdx = a2
    ; rcx = len

; same setup
; TODO: scalar loop here until [rdx+rcx] is aligned.
ALIGN 16
.loop:
    ;  use movsx for signed char
    movdqu    xmm0, [rsi+rcx]    ; 1 uop
    pavgb     xmm0, [rdx+rcx]    ; 2 uops (no micro-fusion)
    movdqu    [rdi+rcx], xmm0    ; 2 uops: no micro-fusion

    add    rcx, 16
    jl     .loop          ; 1 macro-fused uop add/branch
    ; TODO: scalar cleanup.
    ret

Getting the loop-exit condition right is tricky, since you need to end the vector loop if the next 16B goes off the end of the array. Prob. best to handle that by decrementing rcx by 15 or something before adding it to the pointers.

So again, 6 uops / 2 cycles per iteration, but each iteration will do 16 bytes. It's ideal to unroll so your loop is a multiple of 4 uops, so you're not losing out on issue rate with a cycle of less-than-4 uops at the end of a loop. 2 loads / 1 store per cycle is our bottleneck here, since PAVGB has a throughput of 2 per cycle.

16B / cycle shouldn't be difficult on Haswell and later. With AVX2 using ymm registers, you'd get 32B / cycle. (SnB/IvB can only do two memory ops per cycle, at most one of them a store, unless you use 256b loads/stores). Anyway, at this point you've gained a massive 16x speedup from vectorizing, and usually that's good enough. I just enjoy tuning things for theoretical-max throughput by counting uops and unrolling. :)

If you were going to unroll the loop at all, then it would be worth incrementing pointers instead of just an index. (So there would be two uses of [rdx] and one add, vs. two uses of [rdx+rcx]).

Either way, cleaning up the loop setup and keeping everything in registers saves a decent amount of instruction bytes, and overhead for short arrays.

The tip to use vectors was brilliant ;) the algorithm I wanted to optimise is now twice as efficient. But I still have two questions. For using the vectorization the arrays should be 16 Byte aligned, right? And if I know the arrays are actually 32 Byte aligned, couldn't I unroll the loop to do two vector operations? — muXXmit2X, Jul 17 '15 at 13:23
Unrolling doesn't increase alignement reqs. It increases the complexity of the loop end condition / cleanup code (unless you just do scalar for all the leftovers). So it matters if the **length** (minus misalignment) is a multiple of the bytes-per-iteration, because then the cleanup loop doesn't need to run. 32B alignment would matter for AVX2, to avoid crossing a cache line with every other access. — Peter Cordes, Jul 17 '15 at 14:43
Also note that my SSE loop only *needs* rdx to be aligned (or for a scalar intro to run until it is), as the others are accessed with `movdqu`. If you're writing a function to work in the general case, you have to handle the case where each array is misaligned by a different amount. Auto-vectorizing compilers like gcc and clang will sometimes emit diff versions of a loop for different alignment cases. — Peter Cordes, Jul 17 '15 at 14:44
I tried to unroll the loop, but the wasn't any faster so I changed the array alignment back to 16 byte and just do one call to `pavgb` per loop. Thanks for the explanation though ;) — muXXmit2X, Jul 17 '15 at 14:48

NASM - Using Labels as Array offsets

2 Answers2

scalar

vector