1

So this should be a very easy question to answer.

Suppose I'm given two very large variables a and b, and I'd like to subtract b from a. Let's say that each of these variables are 30 words long. I couldn't just use the sub instructor could I? I was taught that it would default to sub.w and would only subtract the first word of b from the first word of a.

So how would I go about doing this?

Sam Taylor
  • 21
  • 2

1 Answers1

2

To perform a multiword subtraction you start at the low-order word and subtract corresponding words from both variables. Use the sbb instruction to handle the borrow, and use sub only on the 1st subtraction.

mov  dx, [ebx]     ;First word of b
sub  [eax], dx     ;Subtract from 1st word of a
mov  dx, [ebx+2]   ;Second word of b
sbb  [eax+2], dx   ;Subtract from 2nd word of a
mov  dx, [ebx+4]   ;Third word of b
sbb  [eax+4], dx   ;Subtract from 3rd word of a
...
mov  dx, [ebx+58]  ;Thirtieth word of b
sbb  [eax+58], dx  ;Subtract from 30th word of a

A more practical solution uses a loop:

mov  ecx, 30
xor  esi, esi      ;This clears the CF, needed for the very first SBB
Again:
mov  dx, [ebx+esi]
sbb  [eax+esi], dx
lea  esi, [esi+2]
loop Again         ; loop without clobbering CF.

There are better ways to write fast adc / sbb loops, but the optimal choice varies between microarchitectures. One simple way to reduce overhead from the slow loop instruction is to unroll a bit.

mov  ecx, 15
xor  esi, esi      ;This clears the CF, needed for the very first SBB
Again:
mov  dx, [ebx+esi]
sbb  [eax+esi], dx
mov  dx, [ebx+esi+2]
sbb  [eax+esi+2], dx
lea  esi, [esi+4]
loop Again

A next step in optimizing this task is stop using the 16-bit register DX, but rather use the larger EDX register. This would halve the number of instructions in the totally unrolled version or more than halve the number of iterations in the looped version. We can do this because "variables that are 30 words long" can be thought of as being "variables that are 15 doublewords long".

This is the totally unrolled version:

mov  edx, [ebx]    ;First dword of b
sub  [eax], edx    ;Subtract from 1st dword of a
mov  edx, [ebx+4]  ;Second dword of b
sbb  [eax+4], edx  ;Subtract from 2nd dword of a
mov  edx, [ebx+8]  ;Third dword of b
sbb  [eax+8], edx  ;Subtract from 3rd dword of a
...
mov  edx, [ebx+56] ;Fifteenth dword of b
sbb  [eax+56], edx ;Subtract from 15th dword of a

and the partially unrolled looped version:

mov  ecx, 5
clc                ;This clears the CF, needed for the very first SBB
Again:
mov  edx, [ebx]    ; <1>
sbb  [eax], edx
mov  edx, [ebx+4]  ; <2>
sbb  [eax+4], edx
mov  edx, [ebx+8]  ; <3>
sbb  [eax+8], edx
lea  ebx, [ebx+12]
lea  eax, [eax+12]
loop Again

Obviously on x86-64 similarly using RDX would improve this code even further. Note though that 30 words correspond to 7 qwords and 1 dword.

Community
  • 1
  • 1
Sep Roland
  • 33,889
  • 7
  • 43
  • 76
  • add and cmp clobber flags. There's no "ideal" way to do multi-precision loops, because `dec / jnz` causes a partial-flags stall, and `loop` is slow. Unrolling a bit and using something clunky helps. – Peter Cordes May 02 '16 at 04:48
  • @PeterCordes Thanks for spotting that `add` and `cmp` clobbered the flags. I edited the answer. – Sep Roland May 02 '16 at 13:29
  • Your edit didn't highlight the fact that looping without clobbering CF is non-trivial. I made an edit, shamelessly linking to my own answers. Feel free to find other links that explain things more concisely than I do. I tend to link my own answers because I remember they exist and can find them easily. :P – Peter Cordes May 02 '16 at 14:38
  • @PeterCordes Would it be a good idea to use 3 different registers EDX, ESI, and EDI in my final code snippet? I marked the lines with <1>, <2>, and <3>. So `mov edx, [ebx]` `mov esi, [ebx+4]` `mov edi, [ebx+8]`. – Sep Roland May 03 '16 at 13:33
  • 1
    Reusing the same register repeatedly is fine; all out-of-order x86 CPUs do register renaming. `mov edx, [ebx+4]` breaks the dependency on the previous value of `edx`, because it's write-only. (You've probably read that `xor edx,edx` is also dependency-breaking. That's a special-case for xor, but the usual case for `mov`.) [Intel CPUs only have one known case of a write-only operand having an unexpected dependency on its input: `popcnt` (and maybe `lzcnt`/`tzcnt`)](http://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-count-variable-with-64-bit-introduces-crazy-performance). – Peter Cordes May 03 '16 at 14:43