Is it "the least possible number of lines" solution?
Your code looks good, and doesn't have any wasted instructions or obvious efficiency (except in your system-call, where mov
to 64-bit registers is a waste of code-size).
But do the 2nd movsx
after both other loads. Out-of-order execution doesn't analyze dependency chains and do loads on the critical path first. The 2nd movsx
load isn't needed until the imul
result is ready, so put it after the imul
so the first 2 loads (movsx
and the imul
memory operand) can execute as early as possible and let the imul
start.
Optimizing asm for the lowest number of instructions (source lines) is generally not useful/important. Either go for code-size (fewest machine-code bytes) or for performance (fewest uops, lowest latency, etc. see Agner Fog's optimization guide, and other links in the x86 tag wiki). For example, idiv
is microcoded on Intel CPUs, and on all CPUs is much slower than any of the other instructions you used.
On architectures with fixed-width instructions, number of instructions is a proxy for code-size, but that's the case on x86 with variable-length instructions.
Anyway, there's no good way to avoid idiv
, though, unless the divisor is a compile-time constant: Why does GCC use multiplication by a strange number in implementing integer division?, and 32-bit operand size (with a 64-bit dividend) is the smallest / fastest version you can use. (Unlike most instructions, div
is faster with narrower operands).
For code-size, you might want to use one RIP-relative lea rdi, [rel dVar1]
, and then access your other variables like [rdi + 4]
, which takes 2 bytes (modr/m + disp8) instead of 5 bytes (modr/m + rel32). i.e. 1 extra byte per memory operand (compared to a register source).
It would make sense to allocate your dword result locations before the word and byte locations, so all the dwords are naturally aligned and you don't have to worry about performance penalties from them being split across a cache line. (Or use align 4
after the db
, before the label and dd
).
The one danger here is that the 64/32 => 32bit division can overflow and fault (with #DE
, leading to SIGFPE on Linux) if the quotient of (dVar1*wVar2) / bVar3
doesn't fit in a 32-bit register. You could avoid that by using 64-bit multiply and divide, the way a compiler would, if that's a concern. But note that 64-bit idiv
is about 3x slower than 32-bit idiv
on Haswell/Skylake. (http://agner.org/optimize/)
; fully safe version for full range of all inputs (other than divide by 0)
movsx rcx, byte [bVar3]
movsxd rax, dword [dVar1] ; new mnemonic for x86-64 dword -> qword sign extension
imul rax, rcx ; rax *= rcx; rdx untouched.
cqo ; sign extend rax into rdx:rax
movsx rcx, word [wVar2]
idiv rcx
mov qword [qRes], rax ; quotient could be up to 32+16 bits
mov dword [dRem], edx ; we know the remainder is small, because the divisor was a sign-extended byte
This is obviously larger code-size (more instructions, and more of them have REX prefixes to use 64-bit operand-size), but less obviously it's much slower because 64-bit idiv
is slow, as I said earlier.
Using movsxd
before a 64-bit imul
with 2 explicit operands is better on most CPUs, but on a few CPUs where 64-bit imul
is slow (AMD Bulldozer-family, or Intel Atom), you could use
movsx eax, byte [bVar3]
imul dword [dVar1] ; result in edx:eax
shl rdx, 32
or rax, rdx ; result in rax
On modern mainstream CPUs, though, 2-operand imul
is faster because it only has to write one register.
Other than instruction-choice:
You put your code in the .data
section! Use section .text
before _start:
, or put your data at the end. (Unlike C, you can reference symbol earlier in the source than they're declared, both labels and equ
constants. Only assembler macros (%define foo bar
) are applied in order).
Also, your source data could go in section .rodata
, and your outputs could go in the BSS. (Or leave them in registers unless your assignment requires memory; nothing is using them.)
Use RIP-relative addressing instead of 32-bit absolute: The default rel
directive isn't the default, but RIP-relative is 1 byte shorter to encode than [abs dVar1]
. (32-bit absolute works in 64-bit executables, but not in 64-bit position-independent executables).
If it's ok for div
to fault if the quotient doesn't fit in 32 bits (like your existing code), this version has all the fixes I suggested:
default rel ; RIP-relative addressing is more efficient, but not the default
;; section .text ; already the default section
global _start
_start:
movsx eax, word [wVar2]
imul dword [dVar1] ;edx:eax = eax * [dVar1]
movsx ecx, byte [bVar3]
idiv ecx ;eax = edx:eax / [bVar3], edx = edx:eax % [bVar3]
; leaving the result in registers is as good as memory, IMO
; but presumably your assignment said to store to memory.
mov dword [dRes], eax
mov dword [dRem], edx
.last: ; local label, or don't use a label at all
mov eax, SYS_exit
xor edi, edi ; rdi = SUCCESS_EXIT. don't use mov reg,0
syscall ; sys_exit(0), 64-bit ABI.
section .bss
dRes: resd 1
dRem: resd 1
section .rodata
dVar1: dd 40400
wVar2: dw -234
bVar3: db -23
; doesn't matter what part of the file you put these in.
; use the same names as asm/unistd.h, SYS_xxx
SYS_exit equ 60
SUCCESS_EXIT equ 0
Style: Use :
after labels, even when it's optional. dVar1: dd 40400
instead of dVar1 dd 40400
. This is a good habit in case you accidentally use a label name that matches an instruction mnemonic. Like enter dd 40400
might give a confusing error message, but enter: dd 40400
would just work.
Don't use mov
to set a register to zero, use xor same,same
because it's smaller and faster. And don't mov
to a 64-bit register when you know your constant is small, let implicit zero-extension do its job. (YASM doesn't optimize mov rax,60
to mov eax,60
for you, although NASM does). Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?.
I do not see any explicit rule that multiplication is only applied to numbers of the same size. But see some implicit ones.
(Almost) all x86 instructions require the same size for all their operands. The exceptions include movzx
/ movsx
, shr reg, cl
(and other variable count shifts/rotates), and stuff like movd xmm0, eax
that copy data between different sets of registers. Also imm8 control operands like pshufd xmm0, xmm1, 0xFF
.
Instructions that normally work with inputs/outputs of the same size don't have versions that widen one of the inputs on the fly.
You can see that imul
does explicitly document what sizes it works with in Intel's instruction-set reference manual entry for it. The only form that does 32x32 => 64 bit result is IMUL r/m32
, i.e. the explicit operand has to be a 32-bit register or memory, to go with the implicit eax
source operand.
So yes, movsx
loads from memory are by far the best way to implement this. (But not the only; cbw
/ cwde
also work to sign extend within EAX, and there's always shl eax, 16
/ sar eax,16
to sign extend from 16 to 32. These are much worse than a movsx
-load.)