Why is the dividend 64 bits in x86 assembly?

Question

Why does the idiv x86 assembly instruction divide EDX:EAX (64 bits) by a given register whereas other mathematical operations, including multiplication, simply operate on single input and output registers?

Multiplication:

mov eax, 3
imul eax, 5

Division:

mov edx, 0
mov eax, 15
mov ebx, 5
idiv ebx

I am aware that EDX is used to store the remainder, but why is there no separate instruction for this behaviour? It just seems inconsistent to me.

score 9 · Accepted Answer · answered Sep 25 '12 at 15:52

9

The instruction set provides the instructions that are necessary to implement arbitrary-width integer arithmetic efficiently. For addition and subtraction, all you need to know for this beyond a fixed width result is whether the operation resulted in a carry (for addition) or a borrow (for subtraction). This is why there is a carry flag. For multiplication, you need to be able to multiply two words and get a double word result. This is why imul produces its result in edx:eax. For division, you need to be able to divide a double-width number and get the quotient and the remainder.

To understand why you need these specific operations, see Knuth's The Art of Computer Programming, Volume 2, which goes into detail on the algorithms for implementing arbitrary-width arithmetic.

As for why there aren't more different forms of multiplication and division instructions in the x86 instruction set, multiplication and division that isn't by a power of two is much more rare than other instructions, so Intel probably didn't want to use up opcodes that could be used for instructions that will be used more frequently. Most multiplications and divisions in general purpose programs are by powers of two; for these you can use bitshifts or the lea instruction instead.

answered Sep 25 '12 at 15:52

Dirk Holsopple

8,731
1
24
37

You have a good point there with the arbitrary-width integer efficiency, this principle is used in half-adders in electronics. However, there's also an imul instruction which takes two registers and multiplies them, whereas this option doesn't exist for idiv. I think Harold explains this well, especially since this imul instruction has a 2-byte opcode which means it's relatively new (I think). – Overv Sep 25 '12 at 16:35
In fact divide a 2n-bit by an n-bit number does not necessarily result in an n-bit number. For example `0x2710 (10000) / 0x02 = 0x1388 (5000)` which will overflow if the result stored as an 8-bit number like the divisor. The reason Intel chose the result of 2n-bit by n-bit division to be n bits maybe just that's simpler – phuclv Dec 22 '13 at 15:49
@Overv: That version of the `IMUL` instruction was added in (I think) the 80186 to assist in address computation. I'm not sure why it was called `IMUL` rather than `MUL`, given that the operand and result size are the same, making it implicitly neither signed nor unsigned. – supercat Feb 09 '14 at 00:50
@supercat: I think they picked IMUL over MUL to get multi-operand version because of `imul r16, r16, sign-extended-imm8`. A sign-extended immediate is probably more useful than zero-extended, and makes more sense with IMUL than MUL. Also, IMUL sets flags in a way that makes sense for signed multiplication. See [this answer](http://stackoverflow.com/a/38254324/224132) where I went into more detail about how they differ for detecting overflow. – Peter Cordes Nov 17 '16 at 04:23
@PeterCordes: The use of a sign-extended 8-bit operand makes the "IMUL" terminology sensible; it does seem unfortunate that Intel didn't split the meanings of carry and overflow, however. – supercat Nov 17 '16 at 14:53
@PeterCordes: I'm curious why the sign-extended form is more useful than a non-extended form. Given an unsigned form, multiplying by -5 could be accomplished, regardless of register choice, by a MUL and a NEG. If one wanted to store AX*251 in BX, one could use an IMUL to compute AX*-5 and then add AL to BH, but that approach would only work with certain combinations of registers. – supercat Nov 17 '16 at 16:12

harold · Answer 2 · 2021-10-07T07:54:19.343

There's a "double-width" multiplication as well (single-operand mul or imul).

If you're asking "why is there no two-operand idiv that only gives the quotient", then I don't really know (I have a theory, but I'm not working for Intel) and I would like that to exist too..

It works out nicely when you want to do a modular multiplication with a modulo that isn't a power of two, you can do a mul and follow it up directly with a div and everything's already in the right place. That's a result, not a reason, and for the reason we'd have to ask Intel.. but here's a theory. Back in the 8086 age, there was only the double-width multiply (and it was the slow kind of iterative multiplication with an early-exit same as you'd do in software). Later some more flexible multiplications were added, but that never happened for division. Perhaps it was not as pressing - after all, divisions are relatively rare, while you often need multiplications by small constants, for example to index arrays of structs.

One form of the multi-operand `imul` is actually new with the 186, another one with the 386. The 286 did not introduce any new forms. — ecm, Oct 07 '21 at 07:49

score 4 · Answer 3 · answered Sep 25 '12 at 19:45

For addition and subtraction your overflow is a single bit handled by the carry flag. If you were to take two arbitrary N bit operands and multiply them you need 2*N bits to store the result, very simple, try it yourself 0xFF * 0xFF = 0xFE01. If you only used N bit sized registers the multiply instruction would be extremely limited. Division is the opposite of multiply divide 2*N bits you get N bits. If you bother with N bits * N bits = 2*N number of bits then you should also implement a 2*N number of bits / N number of bits = N number of bits. That is why it is there, unfortunately though the hardware does more than the languages, the languages should have known and done this as well, if I multiply two bytes the compiler should complain about precision if my result variable is smaller than 16 bits. At the same time any programmer that uses the add, subtract, multiply or divide operations should also be aware of overflow and using those languages use variables that are twice the width of the operands so that they dont overflow...

Often C compilers are smart enough to do the right thing if you write `foo = a * (uint64_t)b` on a 32-bit machine where `a` and `b` are only 32-bit types, and use a single 32-bit full-multiply instead of a full 64 x 64 => 64 bit multiply with the upper halves of both inputs being zero. It is annoying, though, and they do sometimes get this wrong (more often for division, when you can prove that a single 2N / N => N bit division won't overflow (which traps on many machines, including x86), but the compiler still calls the full division library function. e.g. `a*(__uint128_t)b / c` with c>=b). — Peter Cordes, Nov 18 '16 at 03:39

Peter Cordes · Answer 4 · 2021-10-06T19:27:47.440

There are two questions in one here. First, there's the question of double-width input or output, and you're ignoring the one-operand MUL/IMUL forms that do full widening multiplication, including the high half of the result: N * N => 2N bits, doing EDX:EAX = EAX * src. See other answers for why this is useful.

BMI2 even introduced a more flexible full-multiply instruction, MULX, which has three explicit operands (two outputs and one input) and only one implicit operand (second source = EDX).

Second, you give an example of using imul with an immediate operand, another thing that's unavailable for DIV/IDIV.

There is one obscure instruction which is actually an immediate-div, doing 8 bit / imm8 => 8bit quotient/remainder, rather than 16 / 8 => 8. It's called AAM, and isn't available in 64-bit mode. Assemblers default to dividing by 10 (for the intended use-case of BCD), but it's the same opcode with any imm8. Here's how to use DIV or AAM to turn a 0-99 integer into two ASCII digits, also pointing out many of the subtle differences between AAM and DIV r/m8.

Intel could have added immediate versions of IDIV at any time, but never did. My guess is that DIV / IDIV are slow enough (and rare enough) that the extra overhead of mov reg, imm32 is negligible, and that spending opcode space (and decoder transistors) on such an instruction was never deemed worth it.

More importantly, actual hardware division by a compile-time constant is usually only useful for code-size, not performance. Modular multiplicative inverses have been a well-known (by compiler writers) since the 90's. With compilers not even using division by constants, Intel was extremely unlikely to add an instruction for it in CPUs designed after this technique became known. e.g. clang compiles unsigned int div10(unsigned int a) { return a/10; } to

    mov     ecx, edi         # just to zero-extend to 64-bit
    mov     eax, 3435973837  # a sign-extended imm32 can't represent this constant, I guess.  clang uses imul r,r,imm for other cases.
    imul    rax, rcx         # 64-bit multiply instead of 32x32 => 64 in two separate regs
    shr     rax, 35          # extract part of the high-half result.
    ret

It takes a few more instructions for signed division, and sometimes some add/subtract fiddling with the results for less-simple divisors. See some examples on Godbolt. Even so, this is faster than hardware divide instructions, which are very slow, like 22-29 cycles latency for DIV r64 on Haswell, with bad throughput

If they were going to spend opcodes (and decoder transistors / power) on more instructions, a two-register form of IDIV with a single-width dividend might be useful for compilers.

I don't know much about how hardware dividers are implemented internally, so IDK if there are savings to be had from only doing N / N => N bit division instead of the usual 2N / N => N. In compiler output, almost all divisions are done after a CDQ or xor edx,edx. Division is variable-latency on many x86 microarchitectures, so if there's any speedup to be had when the dividend is really only N bits, presumably the hardware already looks for that. However, Skylake DIV/IDIV r32 are constant 26c latency (but 64-bit divisor is much slower and still has very variable latency).

Presumably a DIV r32, r32 instruction would still produce 2 outputs (quotient and remainder), I guess in the two input registers? So you'd often need extra MOV instructions to save your inputs. Or maybe it would take an immediate to select quotient or remainder to go into one destination, or use two separate opcodes for quotient / remainder?

At this point, they could add a VEX-coded version that works a bit like MULX, with three explicit operands. However, the intended use-case for MULX is allowing extended-precision multiplication to interleave with extended-precision add-with-carry, so a DIVX r64(quotient), r64(remainder), r/m64(divisor) (with implicit dividend in RDX?) would be significantly different (less useful for extended precision). They'd probably still make the implicit dividend be RDX:RAX. Or else maybe they wouldn't even call it DIVX, since that's already a trademark for a video codec / company

Why is the dividend 64 bits in x86 assembly?

4 Answers4

Linked