Signed multiplication followed by signed division in YASM (x86_64 architecture)

Question

I am using yasm assembler for x86_64 processor architecture. Suppose I already have three numbers defined in the .data section:

section .data
;CONSTANTS:
SYSTEM_EXIT    equ 60
SUCCESS_EXIT   equ 0

;VARIABLES:
dVar1    dd  40400
wVar2    dw -234
bVar3    db -23
dRes     dd  0    ;quotient
dRem     dd  0    ;reminder

And what I want to do is to multiply signed double-word dVar1 by signed word dVar2 followed by division by signed byte bVar3.

Below I present my "solution" with the referencing this book for why do I do each step. The questions are at the end of text.

dVar1 * wVar2 (signed)

I do not see any explicit rule that multiplication is only applied to numbers of the same size. But see some implicit ones. That's why I use conversion for wVar2:

movsx    eax, word [wVar2]    ;[wVar2] now in eax

Now "they" are of the same size so I just multiply them:

imul    dword [dVar1]    ;edx:eax = eax * [dVar1]

...For example, the result of multiplying ax (16-bits) times a word operand (also 16-bits) provides a double-word (32-bit) result. However, the result is not placed in eax (which might be easier), it is placed in two registers, dx for the upper order result (16-bits) and ax for the lower order result (16-bits), typical written as dx:ax (by convention).

As I understand correctly, result now is in edx:eax.

edx:eax / bVar3 (signed)

...the dividend requires both the D register (for the upper order portion) and A (for the lower order portion)... If a previous multiplication was performed, the D and A registers may already be set correctly (which is my case [OP's note]).

and

...Further, the A, and possibly the D register, must used in combination for the dividend.

Byte Divide: ax for 16-bits

Word Divide: dx:ax for 32-bits

Double-word divide: edx:eax for 64-bits (which is my case [OP's note])

Quadword Divide: rdx:rax for 128-bits

So initially I convert bVar3 to double-word and then just divide it:

movsx    ebx, byte [bVar3]    ;ebx = [bVar3]
idiv     ebx,    ;eax = edx:eax / [bVar3]

The whole code then

section .data
;CONSTANTS:
SYSTEM_EXIT    equ 60
SUCCESS_EXIT   equ 0

;VARIABLES:
dVar1    dd  40400
wVar2    dw -234
bVar3    db -23
dRes     dd  0    ;quotient
dRem     dd  0    ;reminder

section .text

global _start
_start:
    movsx   ebx, byte [bVar3]    ;conversion to double-word
    movsx   eax, word [wVar2]    ;conversion to double-word
    imul    dword [dVar1]        ;edx:eax = eax * [dVar1]
    idiv    ebx                  ;eax = edx:eax / [bVar3], edx = edx:eax % [bVar3]
    mov     dword [dRes], eax
    mov     dword [dRem], edx
last:
    mov     rax, SYSTEM_EXIT
    mov     rdi, SUCCESS_EXIT
    syscall

I use debugger and see the correct answer:

(gdb) x/dw &dRes
0x600159:   411026
(gdb) x/dw &dRem
0x60015d:   -2

But I am not sure about the following things.

Is it really necessary to do those steps I have done? Is it "the least possible number of lines" solution?
Is it correct solution at all? I mean I could have made a mistake or missed something important here.

P.S. Maybe this question is more likely CodeReview SE question. Let me know if you think so too.

Your code faults (with `#DE`, leading to SIGFPE on Linux) if the quotient of `(dVar1*wVar2) / bVar3` doesn't fit in a 32-bit register. You could avoid that by using 64-bit multiply and divide, if that's a concern. — Peter Cordes, Mar 12 '18 at 23:57
And BTW, optimizing asm for the lowest number of instructions (source lines) is generally not useful/important. Either go for code-size (fewest machine-code bytes) or for performance (fewest uops, lowest latency, etc. see http://agner.org/optimize). For example, [`idiv` is microcoded on Intel CPUs](https://stackoverflow.com/questions/26907523/branch-alignment-for-loops-involving-micro-coded-instructions-on-intel-snb-famil), and much slower than any of the other instructions you used on Intel and AMD CPUs. — Peter Cordes, Mar 12 '18 at 23:59

score 1 · Answer 1 · answered Mar 13 '18 at 02:39

Is it "the least possible number of lines" solution?

Your code looks good, and doesn't have any wasted instructions or obvious efficiency (except in your system-call, where mov to 64-bit registers is a waste of code-size).

But do the 2nd movsx after both other loads. Out-of-order execution doesn't analyze dependency chains and do loads on the critical path first. The 2nd movsx load isn't needed until the imul result is ready, so put it after the imul so the first 2 loads (movsx and the imul memory operand) can execute as early as possible and let the imul start.

Optimizing asm for the lowest number of instructions (source lines) is generally not useful/important. Either go for code-size (fewest machine-code bytes) or for performance (fewest uops, lowest latency, etc. see Agner Fog's optimization guide, and other links in the x86 tag wiki). For example, idiv is microcoded on Intel CPUs, and on all CPUs is much slower than any of the other instructions you used.

On architectures with fixed-width instructions, number of instructions is a proxy for code-size, but that's the case on x86 with variable-length instructions.

Anyway, there's no good way to avoid idiv, though, unless the divisor is a compile-time constant: Why does GCC use multiplication by a strange number in implementing integer division?, and 32-bit operand size (with a 64-bit dividend) is the smallest / fastest version you can use. (Unlike most instructions, div is faster with narrower operands).

For code-size, you might want to use one RIP-relative lea rdi, [rel dVar1], and then access your other variables like [rdi + 4], which takes 2 bytes (modr/m + disp8) instead of 5 bytes (modr/m + rel32). i.e. 1 extra byte per memory operand (compared to a register source).

It would make sense to allocate your dword result locations before the word and byte locations, so all the dwords are naturally aligned and you don't have to worry about performance penalties from them being split across a cache line. (Or use align 4 after the db, before the label and dd).

The one danger here is that the 64/32 => 32bit division can overflow and fault (with #DE, leading to SIGFPE on Linux) if the quotient of (dVar1*wVar2) / bVar3 doesn't fit in a 32-bit register. You could avoid that by using 64-bit multiply and divide, the way a compiler would, if that's a concern. But note that 64-bit idiv is about 3x slower than 32-bit idiv on Haswell/Skylake. (http://agner.org/optimize/)

; fully safe version for full range of all inputs (other than divide by 0)
movsx   rcx, byte [bVar3]
movsxd  rax, dword [dVar1]   ; new mnemonic for x86-64 dword -> qword sign extension
imul    rax, rcx             ; rax *= rcx; rdx untouched.

cqo                          ; sign extend rax into rdx:rax
movsx   rcx, word [wVar2]
idiv    rcx

mov     qword [qRes], rax    ; quotient could be up to 32+16 bits
mov     dword [dRem], edx    ; we know the remainder is small, because the divisor was a sign-extended byte

This is obviously larger code-size (more instructions, and more of them have REX prefixes to use 64-bit operand-size), but less obviously it's much slower because 64-bit idiv is slow, as I said earlier.

Using movsxd before a 64-bit imul with 2 explicit operands is better on most CPUs, but on a few CPUs where 64-bit imul is slow (AMD Bulldozer-family, or Intel Atom), you could use

movsx   eax, byte [bVar3]
imul    dword [dVar1]       ; result in edx:eax

shl     rdx, 32
or      rax, rdx            ; result in rax

On modern mainstream CPUs, though, 2-operand imul is faster because it only has to write one register.

Other than instruction-choice:

You put your code in the .data section! Use section .text before _start:, or put your data at the end. (Unlike C, you can reference symbol earlier in the source than they're declared, both labels and equ constants. Only assembler macros (%define foo bar) are applied in order).

Also, your source data could go in section .rodata, and your outputs could go in the BSS. (Or leave them in registers unless your assignment requires memory; nothing is using them.)

Use RIP-relative addressing instead of 32-bit absolute: The default rel directive isn't the default, but RIP-relative is 1 byte shorter to encode than [abs dVar1]. (32-bit absolute works in 64-bit executables, but not in 64-bit position-independent executables).

If it's ok for div to fault if the quotient doesn't fit in 32 bits (like your existing code), this version has all the fixes I suggested:

default rel      ; RIP-relative addressing is more efficient, but not the default

;; section .text  ; already the default section
global _start
_start:
    movsx   eax, word [wVar2]
    imul    dword [dVar1]        ;edx:eax = eax * [dVar1]

    movsx   ecx, byte [bVar3]
    idiv    ecx                  ;eax = edx:eax / [bVar3], edx = edx:eax % [bVar3]

 ; leaving the result in registers is as good as memory, IMO
 ; but presumably your assignment said to store to memory.
    mov     dword [dRes], eax
    mov     dword [dRem], edx

.last:                           ; local label, or don't use a label at all
    mov     eax, SYS_exit
    xor     edi, edi             ; rdi = SUCCESS_EXIT.  don't use mov reg,0
    syscall                      ; sys_exit(0), 64-bit ABI.


section .bss
dRes:     resd 1
dRem:     resd 1

section .rodata
dVar1:    dd  40400
wVar2:    dw -234
bVar3:    db -23

; doesn't matter what part of the file you put these in.
; use the same names as asm/unistd.h, SYS_xxx
SYS_exit       equ 60
SUCCESS_EXIT   equ 0

Style: Use : after labels, even when it's optional. dVar1: dd 40400 instead of dVar1 dd 40400. This is a good habit in case you accidentally use a label name that matches an instruction mnemonic. Like enter dd 40400 might give a confusing error message, but enter: dd 40400 would just work.

Don't use mov to set a register to zero, use xor same,same because it's smaller and faster. And don't mov to a 64-bit register when you know your constant is small, let implicit zero-extension do its job. (YASM doesn't optimize mov rax,60 to mov eax,60 for you, although NASM does). Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?.

I do not see any explicit rule that multiplication is only applied to numbers of the same size. But see some implicit ones.

(Almost) all x86 instructions require the same size for all their operands. The exceptions include movzx / movsx, shr reg, cl (and other variable count shifts/rotates), and stuff like movd xmm0, eax that copy data between different sets of registers. Also imm8 control operands like pshufd xmm0, xmm1, 0xFF.

Instructions that normally work with inputs/outputs of the same size don't have versions that widen one of the inputs on the fly.

You can see that imul does explicitly document what sizes it works with in Intel's instruction-set reference manual entry for it. The only form that does 32x32 => 64 bit result is IMUL r/m32, i.e. the explicit operand has to be a 32-bit register or memory, to go with the implicit eax source operand.

So yes, movsx loads from memory are by far the best way to implement this. (But not the only; cbw / cwde also work to sign extend within EAX, and there's always shl eax, 16 / sar eax,16 to sign extend from 16 to 32. These are much worse than a movsx-load.)

Signed multiplication followed by signed division in YASM (x86_64 architecture)

dVar1 * wVar2 (signed)

edx:eax / bVar3 (signed)

1 Answers1

Other than instruction-choice: