10!
is 10*9*8*7*6*5*4*3*2*1
.
For speed you can minimize the size of intermediate values by doing them in "big * little" pairs, like (1*10) * (2*9) * (3*8) * (4*7) * (5*6)
.
You can describe these pairs as the expression temp = x * (11 - x)
(with values of x
from 1 to 5). If you know the result for x
(e.g. temp = 10
if x == 1
) then the result of the next pair can be derived from the result of the previous pair. E.g.:
temp1 = x * (11 - x)
temp2 = (x+1) * (11 - (x+1))
= (x+1) * (11 - x - 1)
= x * (11 - x - 1) + (11 - x - 1)
= x * (11 - x) - x + (11 - x - 1)
= temp1 - x + (11 - x - 1)
= temp1 - x - x + 11 -1
= temp1 - (x << 1) + 10
In other words; if you know that the result of "pair 1" (or 1*10
) will be 10, then you can calculate the result of the next pair (or 2*9
) by doing (10) - (1<<1) + 10 = 18;
, then calculate the result of the next pair by doing (18) - (2<<1) + 10 = 24
, then...
Note that there is no multiplication involved for calculating the pairs this way. After calculating the intermediate results from pairs (with add/sub alone), you end up with "10! = 10*18*24*28*30
".
For more fun; you can prove that all of these intermediate results will always be even. That means you can cheat and say "10! = 10*18*24*28*30 = 5*9*12*14*15 << (1+1+1+1+1) = 5*9*12*14*15 << 5
"; which is important when you're looking at doing more expensive "too big" multiplications later.
Because these intermediate values fit in 8 bits, we can do "pairs of pairs" just with two 16-bit multiplications. "10! = (5*9) * (12*14) * 15 << 5 = 45 * 168 * 15 << 5
".
By multiplying the smallest intermediate values the result will still fit in 16 bits. "10! = 45 * 168 * 15 << 5 = (45*15) * 168 << 5 = 675 * 168 << 5
".
Now you only need to do one multiplication where the result is 32 bits but both operands are 16 bit. Real mode handles this perfectly fine with a single mul
instruction so that is no problem; it's just that the result will be stored in dx:ax
. Specifically, for 675 * 168 the result in dx:ax
will be 113400 (or 0x1BAF8, or 0x0001 in dx
and 0xBAF8 in ax
).
This gives us "10! = 113400 << 5
". For this it'll be a little awkward on an old "16-bit only" CPU (where you can't just use 32-bit instructions in real mode, including 32-bit mul
). The best way for 8086 is probably using another register for a "right shift then or
" to get the bits from ax
into dx
; like:
mov cl,5
mov bx,ax
shl dx,cl
shl ax,cl
mov cl,16-5
shr bx,cl
or dx,bx
For 80186 and later you can do it slightly better using "shift by immediate" to avoid using cl
, like shl dx,5
, but that isn't supported for 8086.
The result will be 3628800, which is the right answer.
Note that this same approach can be tweaked to handle other factorials - for odd factorials you can ignore the *1
that doesn't actually do anything to ensure the intermediate results from pairs are still even (e.g. "11! = 1 * (2*11) * (3*10) * (4*9) * (5*8) * (6*7) = (2*11) * (3*10) * (4*9) * (5*8) * (6*7)
"). For all factorials (n!
); the intermediate calculation of pairs can be described as temp2 = temp1 - (x << 1) + (n & (~1))
(I think - didn't actually check); and the final shift will always be n/2
rounded down (e.g. for 11!
the shift count will be "(int)11/2 = (int)5.5 = 5
).
With 16-bit multiplication alone it should be good up until 15!. For 16! and larger, the same techniques work to minimize the number of 32-bit multiplications needed.