There are two questions in one here. First, there's the question of double-width input or output, and you're ignoring the one-operand MUL/IMUL forms that do full widening multiplication, including the high half of the result: N * N => 2N bits, doing EDX:EAX = EAX * src
. See other answers for why this is useful.
BMI2 even introduced a more flexible full-multiply instruction, MULX, which has three explicit operands (two outputs and one input) and only one implicit operand (second source = EDX).
Second, you give an example of using imul
with an immediate operand, another thing that's unavailable for DIV/IDIV.
There is one obscure instruction which is actually an immediate-div, doing 8 bit / imm8 => 8bit quotient/remainder, rather than 16 / 8 => 8. It's called AAM, and isn't available in 64-bit mode. Assemblers default to dividing by 10 (for the intended use-case of BCD), but it's the same opcode with any imm8. Here's how to use DIV or AAM to turn a 0-99 integer into two ASCII digits, also pointing out many of the subtle differences between AAM and DIV r/m8
.
Intel could have added immediate versions of IDIV at any time, but never did. My guess is that DIV / IDIV are slow enough (and rare enough) that the extra overhead of mov reg, imm32
is negligible, and that spending opcode space (and decoder transistors) on such an instruction was never deemed worth it.
More importantly, actual hardware division by a compile-time constant is usually only useful for code-size, not performance. Modular multiplicative inverses have been a well-known (by compiler writers) since the 90's. With compilers not even using division by constants, Intel was extremely unlikely to add an instruction for it in CPUs designed after this technique became known. e.g. clang compiles unsigned int div10(unsigned int a) { return a/10; }
to
mov ecx, edi # just to zero-extend to 64-bit
mov eax, 3435973837 # a sign-extended imm32 can't represent this constant, I guess. clang uses imul r,r,imm for other cases.
imul rax, rcx # 64-bit multiply instead of 32x32 => 64 in two separate regs
shr rax, 35 # extract part of the high-half result.
ret
It takes a few more instructions for signed division, and sometimes some add/subtract fiddling with the results for less-simple divisors. See some examples on Godbolt. Even so, this is faster than hardware divide instructions, which are very slow, like 22-29 cycles latency for DIV r64
on Haswell, with bad throughput
If they were going to spend opcodes (and decoder transistors / power) on more instructions, a two-register form of IDIV with a single-width dividend might be useful for compilers.
I don't know much about how hardware dividers are implemented internally, so IDK if there are savings to be had from only doing N / N => N bit division instead of the usual 2N / N => N. In compiler output, almost all divisions are done after a CDQ or xor edx,edx
. Division is variable-latency on many x86 microarchitectures, so if there's any speedup to be had when the dividend is really only N bits, presumably the hardware already looks for that. However, Skylake DIV/IDIV r32 are constant 26c latency (but 64-bit divisor is much slower and still has very variable latency).
Presumably a DIV r32, r32
instruction would still produce 2 outputs (quotient and remainder), I guess in the two input registers? So you'd often need extra MOV instructions to save your inputs. Or maybe it would take an immediate to select quotient or remainder to go into one destination, or use two separate opcodes for quotient / remainder?
At this point, they could add a VEX-coded version that works a bit like MULX, with three explicit operands. However, the intended use-case for MULX is allowing extended-precision multiplication to interleave with extended-precision add-with-carry, so a DIVX r64(quotient), r64(remainder), r/m64(divisor)
(with implicit dividend in RDX?) would be significantly different (less useful for extended precision). They'd probably still make the implicit dividend be RDX:RAX. Or else maybe they wouldn't even call it DIVX, since that's already a trademark for a video codec / company