what is another way of solving it
There are always lots of ways to do anything. Some will be more efficient than others, and there are different measurements of efficiency. Different efficiency measurements include code-size (in instruction bytes), or performance for small arrays or large arrays. For a real 8086, code size is usually the determining factor for performance, but for modern x86 CPUs, that's definitely not true. (See the x86 tag wiki for links to docs).
There's no need to store 10 in memory; it should be an equ
constant. IDK if you're supposed to pretend you're writing a function that doesn't take advantage of everything being assemble-time constants. If so, then just be careful how you use the constant. (Like don't write mov di
n + OFFSET a
to have the end-pointer calculated at assemble time.)
You could avoid the slow loop
instruction without increasing the number of instructions in the loop by counting an index down from the end of the array, and using an indexed addressing mode.
Also, since your arrays are adjacent, you could just use one loop that starts at the start of a and ends at the end of b
mov bx, OFFSET a ; no point in using LEA for this
mov si, length_ab - 1 ; index of the last element
xor ax,ax
sum_loop: ; do {
add al, [bx+si]
dec si
jg sum_loop ; } while(si > 0)
jmp print_num ; tailcall optimization: print_num will return directly to our caller
;call print_num
;ret
section .rodata
a: db 1,3,5,7,9,11,13,15,17,18
b: db 0,2,4,6,8,10,12,14,16,19
end_b: ; put a label after the end of b
length_ab equ $ - a ; this is NASM syntax, IDK if emu8086 accepts it
n equ 10
Or take advantage of a
being static: add AL, [a + SI]
. That might be slower on a real 8086, since it puts an extra 2 bytes of code inside the loop that 8086 would have to re-fetch every time. On modern CPUs, saving the mov bx, OFFSET a
instruction is worth it for total code-size. (If you were using the same pointer many times in a loop, having it in a register can make sense though.)
If you know your sum won't overflow a byte, you could do 2 bytes in parallel with add ax, [si]
, and at the end add al, ah
. But that's definitely a special case, and handling the general case (avoiding carry into the next byte) with SWAR techniques isn't going to work well with only 2-byte words. In 16-bit code for 386 or newer, you could use 32-bit registers and mask off the the odd and even bytes separately.
On some superscalar CPUs (like Intel pre-Sandybridge, which can only execute one load per clock cycle), this would be faster, allowing you to add nearly 2 bytes per clock:
xor ax,ax
xor dx,dx
sum_loop: ; do{
mov cx, [si]
add al, cl
add dl, ch
add si, 2
cmp si, end_a
jb sum_loop ; } while (si < end_pointer)
add al, dl
;; mov ah,0 ; if necessary
But on other CPUs, you might be better off just unrolling and using add al, [si]
/ add dl, [si+1]
instead of using a separate load instruction.
On CPUs other than Intel P6 and Sandybridge-family, al
and ah
aren't renamed separately, so add ah, ch
would have a false dependency on the full ax
register. That's why I used dl
instead of ah
.
Note that xor ax,ax
is not dependency-breaking at least on modern Intel CPUs (Haswell/Skylake). It does zero AX, but it doesn't remove the out-of-order execution data dependency on the old value of EAX. See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent. It might be dep-breaking on Sandybridge and earlier, but definitely prefer xor eax,eax
for zeroing registers.
If you didn't need your code to be compatible with obsolete 8086, you could use SSE2 psadbw
to do the whole thing in only a couple steps.
See Sum reduction of unsigned bytes without overflow, using SSE2 on Intel for an explanation.
Your two arrays are 20 bytes total, so we can hard-code that and handle it as 16 + 4.
pxor xmm0, xmm0 ; xmm0 = 0
movd xmm1, [a+16] ; load last 4 bytes
psadbw xmm1, xmm0 ; sum2 = xmm1 = |b[7]-0| + |b[8]-0] + ...
psadbw xmm0, [a] ; horizontal sum 16 bytes into 2 partial sums in the two 64-bit halves (sum0 and sum1)
; then combine those three 16-bit sums into a single sum.
paddw xmm1, xmm0 ; sum2 += sum0
punpckhqdq xmm0, xmm0 ; get the high half of xmm0
paddw xmm1, xmm0 ; sum2 += sum1
movd eax, xmm1
movzx eax, al ; truncate the sum to 8-bit
jmp print_num
section .rodata
ALIGN 16 ; having a aligned lets us use [a] as a memory operand, or movdqa
a: ...
b: ...
And yes, this will assemble in 16-bit mode (with NASM for example).
Truncating later instead of after every step is fine for addition, because wrap around or carry-out from the low byte is the same thing.
If you couldn't take advantage of a
and b
being adjacent, you could:
movdqu xmm0, [a]
movdqu xmm1, [b]
paddb xmm0, xmm1 ; add packed bytes (no carry across byte boundaries)
psrldq xmm0, 6 ; shift out the high 6 bytes from past the end of a and b
Or even avoid reading past the end of the array:
movq xmm0, [a]
pinsrw xmm0, [a+8], 4
I just realized that since you apparently do want the sum to wrap to 8 bits, you can use paddb
to make this more efficient. For large arrays, you can accumulate with paddb
and do one psadbw
at the end.
movd xmm1, [a+16] ; load last 4 bytes, zeroing the rest of the register
paddb xmm1, [a]
pxor xmm0, xmm0 ; xmm0 = 0
psadbw xmm1, xmm0 ; horizontal sum one vector of byte-sums
movhlps xmm0, xmm1 ; extract high half into a different register
paddw xmm0, xmm1
movd eax, xmm1
movzx eax, al ; truncate the sum to 8-bit
jmp print_num