Nice work on debugging your problem yourself. Since I already started to look at the code, I'll give you some efficiency / style critique as added comments:
%macro prologue 0
push rbp
mov rbp,rsp ; you can drop this and the LEAVE.
; Stack frames were useful before debuggers could keep track of things without them, and as a convenience
; so local variables were always at the same offset from your base pointer, even while you were pushing/popping stuff on the stack.
; With the SysV ABI, you can use the red zone for locals without even
; fiddling with RSP at all, if you don't push/pop or call anything.
push rbx
push r12
push r13
push r14
push r15
%endmacro
%macro epilogue 0
pop r15
pop r14
pop r13
pop r12
pop rbx
leave
ret
%endmacro
segment .data
offset db 1
segment .bss ; These should really be locals on the stack (or in regs!), not globals
a1 resq 1
a2 resq 1
avg resq 1
avgL resd 1
segment .text
; usually a comment with a C function prototype and description is a good idea for functions
global avgArray
avgArray:
prologue
mov [a1], rdi ; what is this sillyness? you have 16 registers for a reason.
mov [a2], rsi ; shuffling the values you want into the regs you want them in
mov [avg], rdx ; is best done with reg-reg moves.
mov [avgL], rcx ; I like to just put a comment at the top of a block of code
; to document what goes in what reg.
mov rsi, [a1]
mov r9, [a2]
mov rdi, [avg]
mov rcx, rsi
add rcx, [avgL] ; This could be lea rcx, [rsi+rcx]
; (since avgL is in rcx anyway as a function arg).
xor rdx, rdx
xor rax, rax
xor rbx, rbx
avgArray_loop: ; you can use a local label here, starting with a .
; You don't need a diff name for each loop: the assembler will branch to the most recent instance of that label
mov al, [rsi] ; there's a data dependency on the old value of ax
mov dl, [r9] ; since the CPU doesn't "know" that shr ax, 1 will always leave ah zeroed in this algorithm
add ax, dx ; Avoid ALU ops on 16bit regs whenever possible. (8bit is fine, they have diff opcodes instead of a prefix)
; to avoid decode stalls on Intel
shr ax, 1 ; Better to use 32bit regs (movsx/movzx)
mov [rdi], al
add rsi, [offset] ; These are 64bit adds, so you're reading 7 bytes after the 1 you set with db.
add r9, [offset]
add rdi, [offset]
cmp rsi, rcx
jb avgArray_loop
epilogue
You have tons of registers free, why are you keeping the loop increment in memory? I hope it just ended up that way while debugging / trying stuff.
Also, 1-reg addressing modes are only more efficient when used as mem operands for ALU ops. Just increment a single counter and used base+offset*scale addressing when you have a lot of pointers (unless you're unrolling the loop), esp. if you load them with mov
.
Here's how I'd do it (with perf analysis for Intel SnB and later):
scalar
; no storage needed
segment .text
GLOBAL avgArray
avgArray:
; void avgArray (uint8_t *avg, const uint8_t *a1, const uint8_t *a2, size_t len)
; if you can choose your prototype, do it so args go where you want them anyway.
; prologue
; rdi = avg
; rsi = a1
; rdx = a2
; rcx = len
; mov [rsp-8], rcx ; if I wanted to spill len to memory
add rcx, rdi
add rcx, rsi
add rcx, rdx
neg rcx ; now [rdi+rcx] is the start of dest, and we can count rcx upwards towards zero.
; We could also have just counted down towards zero
; but HW memory prefetchers have more stream slots for forward patterns than reverse.
ALIGN 16
.loop:
; use movsx for signed char
movzx eax, [rsi+rcx] ; dependency-breaker
movzx r8d, [rdx+rcx] ; Using r8d to save push/pop of rbx
; on pre-Nehalem where insn decode can be a bottleneck even in tight loops
; using ebx or ebp would save a REX prefix (1 insn byte).
add eax, r8d
shr eax, 1
mov [rdi+rcx], al
inc rcx ; No cmp needed: this is the point of counting up towards zero
jl .loop ; inc/jl can Macro-fuse into one uop
; nothing to pop, we only used caller-saved regs.
ret
On Intel, the loop is 7 uops, (the store is 2 uops: store address and store data, and can't micro-fuse), so a CPU that can issue 4 uops per cycle will do it at 2 cycles per byte. movzx
(to a 32 or 64bit reg) is 1 uop regardless, because there isn't a port 0/1/5 uop for it to micro-fuse with or not. (It's a read, not read-modify).
7 uops takes 2 chunks of up-to-4 uops, so the loop can issue in 2 cycles. There are no other bottlenecks that should prevent the execution units from keeping up with that, so it should run one per 2 cycles.
vector
There's a vector instruction to do exactly this operation: PAVGB
is packed avg of unsigned bytes (with a 9-bit temporary to avoid overflow, same as your add/shr).
; no storage needed
segment .text
GLOBAL avgArray
avgArray:
; void avgArray (uint8_t *avg, const uint8_t *a1, const uint8_t *a2, size_t len)
; rdi = avg
; rsi = a1
; rdx = a2
; rcx = len
; same setup
; TODO: scalar loop here until [rdx+rcx] is aligned.
ALIGN 16
.loop:
; use movsx for signed char
movdqu xmm0, [rsi+rcx] ; 1 uop
pavgb xmm0, [rdx+rcx] ; 2 uops (no micro-fusion)
movdqu [rdi+rcx], xmm0 ; 2 uops: no micro-fusion
add rcx, 16
jl .loop ; 1 macro-fused uop add/branch
; TODO: scalar cleanup.
ret
Getting the loop-exit condition right is tricky, since you need to end the vector loop if the next 16B goes off the end of the array. Prob. best to handle that by decrementing rcx by 15 or something before adding it to the pointers.
So again, 6 uops / 2 cycles per iteration, but each iteration will do 16 bytes. It's ideal to unroll so your loop is a multiple of 4 uops, so you're not losing out on issue rate with a cycle of less-than-4 uops at the end of a loop. 2 loads / 1 store per cycle is our bottleneck here, since PAVGB
has a throughput of 2 per cycle.
16B / cycle shouldn't be difficult on Haswell and later. With AVX2 using ymm registers, you'd get 32B / cycle. (SnB/IvB can only do two memory ops per cycle, at most one of them a store, unless you use 256b loads/stores). Anyway, at this point you've gained a massive 16x speedup from vectorizing, and usually that's good enough. I just enjoy tuning things for theoretical-max throughput by counting uops and unrolling. :)
If you were going to unroll the loop at all, then it would be worth incrementing pointers instead of just an index. (So there would be two uses of [rdx] and one add, vs. two uses of [rdx+rcx]).
Either way, cleaning up the loop setup and keeping everything in registers saves a decent amount of instruction bytes, and overhead for short arrays.