To perform a multiword subtraction you start at the low-order word and subtract corresponding words from both variables. Use the sbb
instruction to handle the borrow, and use sub
only on the 1st subtraction.
mov dx, [ebx] ;First word of b
sub [eax], dx ;Subtract from 1st word of a
mov dx, [ebx+2] ;Second word of b
sbb [eax+2], dx ;Subtract from 2nd word of a
mov dx, [ebx+4] ;Third word of b
sbb [eax+4], dx ;Subtract from 3rd word of a
...
mov dx, [ebx+58] ;Thirtieth word of b
sbb [eax+58], dx ;Subtract from 30th word of a
A more practical solution uses a loop:
mov ecx, 30
xor esi, esi ;This clears the CF, needed for the very first SBB
Again:
mov dx, [ebx+esi]
sbb [eax+esi], dx
lea esi, [esi+2]
loop Again ; loop without clobbering CF.
There are better ways to write fast adc
/ sbb
loops, but the optimal choice varies between microarchitectures. One simple way to reduce overhead from the slow loop
instruction is to unroll a bit.
mov ecx, 15
xor esi, esi ;This clears the CF, needed for the very first SBB
Again:
mov dx, [ebx+esi]
sbb [eax+esi], dx
mov dx, [ebx+esi+2]
sbb [eax+esi+2], dx
lea esi, [esi+4]
loop Again
A next step in optimizing this task is stop using the 16-bit register DX, but rather use the larger EDX register. This would halve the number of instructions in the totally unrolled version or more than halve the number of iterations in the looped version. We can do this because "variables that are 30 words long" can be thought of as being "variables that are 15 doublewords long".
This is the totally unrolled version:
mov edx, [ebx] ;First dword of b
sub [eax], edx ;Subtract from 1st dword of a
mov edx, [ebx+4] ;Second dword of b
sbb [eax+4], edx ;Subtract from 2nd dword of a
mov edx, [ebx+8] ;Third dword of b
sbb [eax+8], edx ;Subtract from 3rd dword of a
...
mov edx, [ebx+56] ;Fifteenth dword of b
sbb [eax+56], edx ;Subtract from 15th dword of a
and the partially unrolled looped version:
mov ecx, 5
clc ;This clears the CF, needed for the very first SBB
Again:
mov edx, [ebx] ; <1>
sbb [eax], edx
mov edx, [ebx+4] ; <2>
sbb [eax+4], edx
mov edx, [ebx+8] ; <3>
sbb [eax+8], edx
lea ebx, [ebx+12]
lea eax, [eax+12]
loop Again
Obviously on x86-64 similarly using RDX would improve this code even further. Note though that 30 words correspond to 7 qwords and 1 dword.