Multiply two 16 bit numbers and store 32 bit answer in dx:ax without mul instruction in assembly 8086

Question

I am trying to multiply two 16 bits numbers using the shift and add methods in assembly language and store the hi part in the dx register and the low part in the ax register. The multiplicand and multiplier are passed on the stack For some of my tests, I can get the right answer, but for some, the part that holds the higher part,dx, is wrong. For example if I do 0002 times 0001 I get in the my answer, dx = 0002 ax = 0002, when the answer should be dx = 0000 ax = 0002.

Here is my code. I can't seem to know where my code is going wrong. I even did this example by hand and don't see how the dx = 0002 part gets there.

;---------------------------------------
; Multiply data
;---------------------------------------

h         dw        0                   ; this holds the high order bits

mltplier  dw        0                   ; this holds the mulitplier

     .code
;---------------------------------------
; Multiply code
;---------------------------------------
_multiply:                             ;
     push      bp                  ; save bp
     mov       bp,sp               ; anchor bp into the stack
     mov       bx,[bp+4]           ; load multiplicand from the stack
     mov       cx,[bp+6]           ; load multiplier   from the stack
     mov       [mltplier],cx       ;
     mov       cx,0Fh              ; make counter of 16
     mov       ax,0                ;
     mov       dx,0                ;

;  calculate multiplicand * multiplier
;  return result in dx:ax
_loop:
     shr       [mltplier],1        ; shift right by 1
     jnc       shift               ; if the number shifted out was not a 1           
                                   ;then we don't need to add anything
     clc                           ;clear carry flag
     add       ax,bx               ; add bx to ax, the low bits
     add       dx,[h]              ; add var to dx, the high bits
shift:                                 ;
     shl       [h],1               ; shift the high order bits left
     shl       bx,1                ; shift the low order bits left
     adc       [h],0               ; add to the high bits
     clc                           ;clear carry flag
    loop       _loop               ; loop the process
     pop       bp                  ; restore bp
     ret                           ; return with result in dx:ax
                                   ;
     end                           ; end source code
;---------------------------------------

Do you have an implementation running on a computer? Or only "on paper"? If you have a running implementation, using a debugger to step through your program will help tremendously. — WhiteViking, Oct 24 '15 at 16:29
even with 16 I would get dx = 0004. And yes I did run this code on a computer and is how I found out I got different answers for my test cases — mukhtar23, Oct 24 '15 at 16:44
Doen't every MCU book contain examples for this sort of thing? — Weather Vane, Oct 24 '15 at 16:44
You are shifting the multiplier right and testing its l.s. bit. You need to shift it left and test its m.s. bit, because you are shifting the product left. You also need to shift the product *before* you do the addition. — Weather Vane, Oct 24 '15 at 16:47
Also why are you not using the registers si and di? And for the shift left you can use the shld instruction. — Vlad Krasnov, Oct 24 '15 at 17:25
@VladKrasnov: `shld` was introduced with 386, and I think the OP is targeting 8086. I noticed `si/di` going unused, too, though. See my answer — Peter Cordes, Oct 24 '15 at 17:37

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

WeatherVane's comment might have the solution to the wrong answers.

Some notes on efficiency:

zero a register by XORing with itself. It takes fewer instruction bytes than mov r, 0, and is better in every way. (Prefer XOR over sub same,same or other options, because more CPUs recognize xor same,same as independent of the old value.)
You don't need clc after a jnc. That clc is only reachable when carry is already cleared. The clc before the loop instruction is also useless, because you run other instructions that set or clear CF before the next adc.
Keeping variables in memory is slow. Instead of shr [mltplier],1, keep mltplier in si or di. (push/pop to save/restore a register once for the whole function call is worth it if you can use a register in a loop instead of a memory location). Similarly, keep [h] in a register, too.

If you need to spill to memory, in general prefer the stack, not global variables, so your function is re-entrant and thread-safe. Esp. for mltplier, you could have just used the value the caller put on the stack, instead of copying it.
loop is slow on modern x86 CPUs, compared to dec cx / jne. e.g. about 7 times the loop overhead on Haswell. You could save a register and speed up your loop by looping on multiplier != 0, instead of always doing 16 trips through the loop. Then you could keep mult in cx and loop with test cx, cx / jne.
With h in a reg (di for example):

 shl       [h],1               ; shift the high order bits left
 shl       bx,1                ; shift the low order bits left
 adc       [h],0               ; add to the high bits

could be:

 shl       bx, 1               ; shift the low order bits left
 adc       di, di              ; shift the high order bits left and add the carry

If you're targeting a 386 CPU, shld two-register shift would also work, with the advantage that both instructions could run in parallel, instead of one depending on the other:

 shld      di, bx, 1
 shl       bx, 1

shld r,r,i is cheaper than adc on Intel Sandybridge-family CPUs. (1 uop vs. 2)

See Agner Fog's instruction tables and guides, and other links in the x86 tag wiki.

score 2 · Answer 2 · answered Oct 24 '15 at 19:02

This shows how to multiply two 16-bit values to get a 32-bit value (in two 16-bit registers).

#include <stdio.h>

unsigned multiply16x16(unsigned short m, unsigned short n) {
    __asm {
        xor     ax,ax       ; clear the product
        xor     dx,dx
        mov     cx,16       ; set up loop counter
    nextbit:
        shl     ax,1        ; shift 32-bit product left
        adc     dx,dx
        shl     [m],1       ; get m.s. bit of multiplier
        jnc     noadd       ; ignore if not set
        add     ax,[n]      ; add multiplicand to product
        adc     dx,0        ; with carry
    noadd:
        loop    nextbit     ; loop counter stops when cx  0
        mov     [m],ax      ; store in 16-bit operands
        mov     [n],dx
    }
    return (n << 16) + m;   // align and return as 32-bit unsigned
}

int main(void){
    unsigned short m, n;

    m=3; n=5;
    printf("%u\n", multiply16x16 (m,n));

    m=65535; n=2;
    printf("%u\n", multiply16x16 (m,n));

    m=987; n=654;
    printf("%u\n", multiply16x16 (m,n));

    m=123; n=456;
    printf("%u\n", multiply16x16 (m,n));

    m=65535; n=65535;
    printf("%u\n", multiply16x16 (m,n));

    return 0;
}

Program output:

score 1 · Answer 3 · answered Oct 24 '15 at 19:54

A more efficient and entertaining approach is to subvert the exercise by synthesizing the proscribed unsigned multiplication (mul) out of a signed multiplication (imul).

Flipping the MSB of an unsigned integer, equivalent to subtracting 8000h modulo 2^16, maps the value into the range of signed integers without underflow. Thus allowing (a-8000h)*(b-8000h) to be computed, and adding back a*8000h + b*8000h - 4000000h yields a*b

multiply:
    push bp
    mov bp,sp
    mov ax,[bp+4]
    xor ax,8000h
    mov dx,[bp+6]
    xor dx,8000h
    imul dx
    sub dx,4000h
    mov cx,[bp+4]
    add cx,[bp+6]
    rcr cx,1    ;Recovery lost carry while dividing
    jnc @f      ;by two and adding back the a+b term
    add ax,8000h
@@: adc dx,cx
    pop bp
    ret

(For the record this is more of a comment posted in the form of an answer due to its length.)

Multiply two 16 bit numbers and store 32 bit answer in dx:ax without mul instruction in assembly 8086

3 Answers3