0

Write an assembly language procedure to multiply two integers by doubling and halving. Here is pseudocode to multiply A and B using this approach.

Multiply A and B and store result in C:
C = 0 ; Initialize to 0
while B is not equal to 0:
if B is odd then C = C+A ; Add copy of A (even number left)
A = 2*A ; Can be done quickly by shifting left
B = B/2 ; Can be done quickly by shifting right

I have a lot done already, how would I use the shl to test for an odd integer?

David Wohlferd
  • 7,110
  • 2
  • 29
  • 56
jeff
  • 9
  • 1

2 Answers2

1

it can use the 'shr' instruction;

mov al, 0x10010010b
shr al, 1      ;CF = 0   even

mov al, 0x10010011b
shr al, 1      ;CF = 1   odd

jnb __lable    ;jump if CF = 0
or
jb  __lable    ;jump if CF = 1
wang.c
  • 11
  • 1
  • 1
    Normally you'd use the `jc` and `jnc` mnemonics (aliases for jb / jnb) when CF doesn't come from a compare result. It's easier to remember. – Peter Cordes Oct 22 '18 at 07:26
0

for x86_64 you can use the shr for both the oddness and zero test...

Like for example this:

To build and test in debugger I used commands (on 64b linux):

nasm -f elf64 m.asm -l m.lst -w+all; ld -b elf64-x86-64 -o m m.o
edb --run ./m

source:

    segment .text

mulByBits:
; input: 2x uint32_t in edi, esi (System V AMD64 ABI calling convention)
; output: uint64_t in rax
; modifies also: rcx
    xor     eax, eax            ; rax = 0 "C"
    mov     edi, edi            ; clear upper 32b of input "A" (extend to 64b)
.mulLoop:
    lea     rcx, [rax + rdi]    ; rcx = C + A (new C in case B is odd)
    add     rdi, rdi            ; A *= 2 (for next loop)
    shr     esi, 1              ; B >>= 1 (sets ZF and CF)
    cmovc   rax, rcx            ; if B was odd, update the sum to new C
    jnz     .mulLoop            ; repeat until B is zero
    ret

global _start
_start:     ; run some hardcoded simple tests, verify in debugger
    mov     edi, 143254         ; "normal" values test
    mov     esi, 43526
    call    mulByBits
    mov     rbx, 6235273604     ; expected result, compare with rax

    mov     edi, 0
    mov     esi, 0
    call    mulByBits
    mov     rbx, 0

    mov     edi, 43257432
    mov     esi, 0
    call    mulByBits
    mov     rbx, 0

    mov     edi, 0
    mov     esi, 432543
    call    mulByBits
    mov     rbx, 0

    mov     edi, 3276547234
    mov     esi, 1
    call    mulByBits
    mov     rbx, 3276547234

    mov     edi, 1
    mov     esi, 3276547234
    call    mulByBits
    mov     rbx, 3276547234

    mov     edi, ~0             ; UINT_MAX * UINT_MAX
    mov     esi, ~0
    call    mulByBits
    mov     rbx, 0xFFFFFFFE00000001

    mov     rdi, 0xE00000004    ; garbage in upper 32 bits of inputs
    mov     rsi, 0xE00000004    ; should be multiplied as 4*4
    call    mulByBits
    mov     rbx, 0x10

    ; exit back to linux
    mov     eax, 60
    xor     edi, edi
    syscall

Adding is the most efficient way to left shift by 1 on most CPUs; add same,same can run on more execution ports than shl rdi,1 (https://agner.org/optimize), allowing more instruction-level parallelism for potentially better throughput of the loop.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Ped7g
  • 16,236
  • 3
  • 26
  • 63
  • `add rdi,rdi` can run on more ports than `shl rdi,1`, and gives the same integer result and most of the same flag results. (Similarly for `adc same,same` vs. `rcl`-by-1). Unfortunately modern Intel CPUs don't decode/run the left-shift-by-1 short-form opcode to an add-like uop that runs on any ALU port. – Peter Cordes Oct 22 '18 at 07:32
  • @PeterCordes indeed, thank you, modified the source, as this was supposed to be rather show-off of the "correct" x86_64, than following the assignment literally. (although the `mul` itself does exist and is faster than this bit banging, so it's still just exercise for the sake of exercise) – Ped7g Oct 22 '18 at 07:38
  • add same,same *is* a left shift, and `shl` by 1 isn't exactly slow; but most CPUs have at most 2 integer shift ports. Anyway, looks good, I'll upvote tomorrow. Already used up my votes for today. – Peter Cordes Oct 22 '18 at 07:38
  • BTW, `mov edx, edi` has 1 cycle lower latency than `mov edi,edi` on Intel CPUs with mov-elimination. ([Can x86's MOV really be "free"? Why can't I reproduce this at all?](https://stackoverflow.com/q/44169342)). But I didn't think that was worth rewriting the register allocation for your function. – Peter Cordes Oct 22 '18 at 07:47
  • @PeterCordes the `nop`s make it visually more clear in the debugger (as I don't use source view, just disassembly), but no problem with removing them. The -1 cycle .. nah, not worth further editing, your comment is enough, thank you for edit. – Ped7g Oct 22 '18 at 07:49
  • @PeterCordes one more idea was nagging me since start... would it be worth to sort `B < A` at the beginning? That would probably boost cases like `1 * 4276543` a bit, but not sure if the penalty for compare with optional swap (or probably rather two variants of code?) will not hurt most of the use cases, if one would expect the values to be more often close to each other in terms of log2 ... – Ped7g Oct 22 '18 at 08:06
  • 1
    Yeah, the worth depends on the data distribution. A branch-miss before any work gets done means you're back at the start of the dep chains, while the mispredict for leaving the loop after ~log2(n) iterations isn't a big deal if later instructions mostly depend on the multiply result anyway. (That single-cycle `shr` dep chain can run ahead of the rest of the loop.) But if there's often a big different in highest set bit, you could save a lot of iterations. Looping over only the set bits using variable-count shifts and `tzcnt` is also possible, and maybe a win with few set bits. – Peter Cordes Oct 22 '18 at 08:40