the sum of two arrays n bytes each

Question

that's my solution i'd like to know if it's correct and what is another way of solving it

include 'emu8086.inc'
org 100h

mov CX,n
lea SI,a
mov AL,0

start_sum_a:         ;sum all the n elements of the first array
add AL, [SI] 
inc SI
loop start_sum_a

mov CX,n
lea SI,b 

start_sum_b:        ;sum all the n elements of the 2nd array to 
add AL, [SI]        ;the first sum
inc SI
loop start_sum_b

call print_num      ;print the sum

ret

a db 1,3,5,7,9,11,13,15,17,18
b db 0,2,4,6,8,10,12,14,16,19
n dw 10

DEFINE_PRINT_NUM
DEFINE_PRINT_NUM_UNS

Is your sum supposed to wrap around at 255, or are you supposed to zero-extend before adding into a 16-bit accumulator? And yes, of course there are other ways to solve it. For example, with SSE2, load all 16 bytes into one xmm register, and use `pxor xmm1,xmm1` / `psadbw xmm0, xmm1`. See https://stackoverflow.com/questions/10932550/sum-reduction-of-unsigned-bytes-without-overflow-using-sse2-on-intel for a full solution. — Peter Cordes, Aug 26 '17 at 23:15
Still using 8086 (where SSE or even 32-bit registers aren't available), if you know your sum won't overflow a byte, you could do 2 bytes in parallel with `add ax, [si]`, and at the end `add al, ah`. You could also use `SI` as your loop condition, instead of using CX as just a counter. And there's no need to store `10` in memory; it could be an `equ` constant. Also, since your arrays are adjacent, you could just use one loop that starts at the start of `a` and ends at the end of `b`. — Peter Cordes, Aug 26 '17 at 23:19
thank you so much i appreciate all this effort it really helped me alot — user8521870, Aug 28 '17 at 16:46

score 2 · Answer 1 · answered Aug 27 '17 at 15:47

i'd like to know if it's correct

Your solution looks OK, although I got a hunch that the function print_num defined in "emu8086.inc" will rather expect the number in the AX register. So better change the mov AL,0 instruction into xor ax,ax which will clear the whole AX register and not just its low byte AL.

what is another way of solving it

You could choose to do the work in a single loop if you setup separate pointers for both arrays.

    lea  si, a
    lea  di, b
    mov  cx, n
    xor  ax, ax
start_sum:
    add  al, [si]        ;Element of a array
    add  al, [di]        ;Element of b array
    inc  si
    inc  di
    loop start_sum

But since the starting points of these arrays are a certain distance (10) apart in memory, there's a solution using just 1 pointer:

    lea  si, a
    mov  cx, n
    xor  ax, ax
start_sum:
    add  al, [si]        ;Element of a array
    add  al, [si + 10]   ;Element of b array
    inc  si
    loop start_sum

Finally since these arrays are adjacent in memory, the loop can be simpler still. Just double the number of iterations (One of the suggestions by Peter Cordes):

    lea  si, a
    mov  cx, n
    shl  cx, 1           ;Double the counter
    xor  ax, ax
start_sum:
    add  al, [si]        ;Element of a array or b array
    inc  si
    loop start_sum

Or better, use `al` and `ah` (or `al` and `dl`) and add at the end, so your code can run faster on superscalar CPUs. Unrolling with only one accumulator loses a lot of the benefit. — Peter Cordes, Aug 27 '17 at 16:57
Also, if the offset between `a` and `b` has to be a runtime variable, put it in a register and use an indexed addressing mode. `add al, [si]` / `add ah, [si + di]` — Peter Cordes, Aug 27 '17 at 17:01
I posted my own answer, since what I wrote in comments should have been posted as an answer in the first place. — Peter Cordes, Aug 27 '17 at 18:34
@PeterCordes And what an answer it has become. I could not have imagined there was so much to be told about this modest question. Nice! — Sep Roland, Aug 27 '17 at 18:42

Peter Cordes · Answer 2 · 2017-08-28T01:13:59.170

what is another way of solving it

There are always lots of ways to do anything. Some will be more efficient than others, and there are different measurements of efficiency. Different efficiency measurements include code-size (in instruction bytes), or performance for small arrays or large arrays. For a real 8086, code size is usually the determining factor for performance, but for modern x86 CPUs, that's definitely not true. (See the x86 tag wiki for links to docs).

There's no need to store 10 in memory; it should be an equ constant. IDK if you're supposed to pretend you're writing a function that doesn't take advantage of everything being assemble-time constants. If so, then just be careful how you use the constant. (Like don't write mov di n + OFFSET a to have the end-pointer calculated at assemble time.)

You could avoid the slow loop instruction without increasing the number of instructions in the loop by counting an index down from the end of the array, and using an indexed addressing mode.

Also, since your arrays are adjacent, you could just use one loop that starts at the start of a and ends at the end of b

mov   bx, OFFSET a          ; no point in using LEA for this
mov   si, length_ab - 1     ; index of the last element
xor   ax,ax

sum_loop:              ; do {
add   al, [bx+si]
dec   si
jg  sum_loop           ; } while(si > 0)

jmp   print_num        ; tailcall optimization: print_num will return directly to our caller
;call print_num
;ret

section .rodata
a:  db 1,3,5,7,9,11,13,15,17,18
b:  db 0,2,4,6,8,10,12,14,16,19
end_b:                   ; put a label after the end of b
length_ab equ $ - a      ; this is NASM syntax, IDK if emu8086 accepts it
n equ 10

Or take advantage of a being static: add AL, [a + SI]. That might be slower on a real 8086, since it puts an extra 2 bytes of code inside the loop that 8086 would have to re-fetch every time. On modern CPUs, saving the mov bx, OFFSET a instruction is worth it for total code-size. (If you were using the same pointer many times in a loop, having it in a register can make sense though.)

If you know your sum won't overflow a byte, you could do 2 bytes in parallel with add ax, [si], and at the end add al, ah. But that's definitely a special case, and handling the general case (avoiding carry into the next byte) with SWAR techniques isn't going to work well with only 2-byte words. In 16-bit code for 386 or newer, you could use 32-bit registers and mask off the the odd and even bytes separately.

On some superscalar CPUs (like Intel pre-Sandybridge, which can only execute one load per clock cycle), this would be faster, allowing you to add nearly 2 bytes per clock:

    xor   ax,ax
    xor   dx,dx
sum_loop:               ; do{
    mov   cx, [si]
    add   al, cl
    add   dl, ch

    add   si, 2
    cmp   si, end_a
    jb  sum_loop        ; } while (si < end_pointer)

    add   al, dl
    ;; mov ah,0   ; if necessary

But on other CPUs, you might be better off just unrolling and using add al, [si] / add dl, [si+1] instead of using a separate load instruction.

On CPUs other than Intel P6 and Sandybridge-family, al and ah aren't renamed separately, so add ah, ch would have a false dependency on the full ax register. That's why I used dl instead of ah.

Note that xor ax,ax is not dependency-breaking at least on modern Intel CPUs (Haswell/Skylake). It does zero AX, but it doesn't remove the out-of-order execution data dependency on the old value of EAX. See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent. It might be dep-breaking on Sandybridge and earlier, but definitely prefer xor eax,eax for zeroing registers.

If you didn't need your code to be compatible with obsolete 8086, you could use SSE2 psadbw to do the whole thing in only a couple steps.

See Sum reduction of unsigned bytes without overflow, using SSE2 on Intel for an explanation.

Your two arrays are 20 bytes total, so we can hard-code that and handle it as 16 + 4.

pxor    xmm0, xmm0    ; xmm0 = 0
movd    xmm1, [a+16]  ; load last 4 bytes
psadbw  xmm1, xmm0    ; sum2 = xmm1 = |b[7]-0| + |b[8]-0] + ...
psadbw  xmm0, [a]     ; horizontal sum 16 bytes into 2 partial sums in the two 64-bit halves (sum0 and sum1)

; then combine those three 16-bit sums into a single sum.
paddw   xmm1, xmm0    ; sum2 += sum0
punpckhqdq xmm0, xmm0 ; get the high half of xmm0
paddw   xmm1, xmm0    ; sum2 += sum1
movd    eax, xmm1

movzx    eax, al      ; truncate the sum to 8-bit

jmp    print_num

section .rodata
ALIGN 16          ; having a aligned lets us use [a] as a memory operand, or movdqa
a: ...
b: ...

And yes, this will assemble in 16-bit mode (with NASM for example).

Truncating later instead of after every step is fine for addition, because wrap around or carry-out from the low byte is the same thing.

If you couldn't take advantage of a and b being adjacent, you could:

movdqu  xmm0, [a]
movdqu  xmm1, [b]
paddb   xmm0, xmm1  ; add packed bytes (no carry across byte boundaries)
psrldq  xmm0, 6     ; shift out the high 6 bytes from past the end of a and b

Or even avoid reading past the end of the array:

movq    xmm0, [a]
pinsrw  xmm0, [a+8], 4

I just realized that since you apparently do want the sum to wrap to 8 bits, you can use paddb to make this more efficient. For large arrays, you can accumulate with paddb and do one psadbw at the end.

movd    xmm1, [a+16]  ; load last 4 bytes, zeroing the rest of the register
paddb   xmm1, [a]
pxor    xmm0, xmm0    ; xmm0 = 0
psadbw  xmm1, xmm0    ; horizontal sum one vector of byte-sums

movhlps xmm0, xmm1    ; extract high half into a different register
paddw   xmm0, xmm1    
movd    eax, xmm1     

movzx    eax, al      ; truncate the sum to 8-bit
jmp    print_num

I'm no expert on SSE, but is `movq eax, xmm1` correct? I somewhat expected `movd ...` — Sep Roland, Aug 27 '17 at 18:49
@SepRoland: thanks, fixed. I think my head got ahead of my fingers there. (Fun fact: some assemblers accept or maybe require `movd` for 32 or 64-bit GP <-> XMM data movement, and use `movq` for the MMX/SSE load/store instruction (which is a separate opcode).) — Peter Cordes, Aug 27 '17 at 18:51

the sum of two arrays n bytes each

2 Answers2