Getting the high part of a 32 bit multiplication on x86_64

Question

The x86_64 mul instruction can take two 32 bit integers and put the high and low 32 bits of the 64 bit result in RDX:RAX. However, gcc & clang can't seem to output that code. Instead, for the following source:

uint32_t mulhi32(uint32_t a, uint32_t b) {
    return (a * (uint64_t) b) >> 32;
}

they output:

    mov     esi, esi
    mov     edi, edi
    imul    rdi, rsi
    shr     rdi, 32
    mov     rax, rdi
    ret

which extends the two inputs to 64 bits, then does a 64*64 bit multiply producing a 128 bit result, then does a shift right on the lower 64 bits, and finally returns that. Madness!

The equivalent for 64 bits:

uint64_t mulhi64(uint64_t a, uint64_t b) {
    return (a * (unsigned __int128) b) >> 64;
}

produces:

    mov     rax, rdi
    mul     rsi
    mov     rax, rdx
    ret

Why don't gcc & clang do the equivalent thing for 32 bits?

It's not just GCC and clang... icc does it too. What that tells me is there is a microarchitectural reason to avoid single operand `mul` if at all possible. A quick look at Agner's tables shows that `MUL` can be up to (depending on CPU) three times more expensive. So the 32bit version should be able to complete in 5 cycles *or less* the 64bit version will require a minimum of 5 cycles due to register dependencies. — Mgetz, Apr 11 '22 at 14:03
Does this answer your question? [How to get single-operand imul from C code](https://stackoverflow.com/questions/36381376/how-to-get-single-operand-imul-from-c-code) — phuclv, Apr 11 '22 at 16:55
Not only GCC, Clang or ICC, MSVC and all other optimizing compilers for x86 do that too, because multi-operand `imul` is much more efficient — phuclv, Apr 11 '22 at 16:56
@phuclv: It's more efficient by a couple uops; having to zero-extend both inputs makes it about break-even, especially if a silly compiler defeats mov-elimination by zero-extending within the same register. (But yeah, not much worse). But after inlining, normally the compiler would know 32-bit values were already zero-extended. (Also, GCC is dumb here: `mov eax, esi` on one input would have saved the final mov after the shift.) — Peter Cordes, Apr 11 '22 at 17:06
@phuclv: Also not fully a duplicate; your answer there is about getting any one-operand form of `mul` or `imul`. This is about *efficiently* getting the high half of a 32-bit multiply, and the assumption that one-operand `mul r32` would be most efficient (3 uops on Intel, 2 on AMD: https://uops.info/table.html). — Peter Cordes, Apr 11 '22 at 17:10
I agree that the answer to the posted question has some of the answer (imul is faster than mul), but it's a different question and needs a different answer, as Peter Cordes pointed out. — Martin C. Martin, Apr 11 '22 at 18:08

Getting the high part of a 32 bit multiplication on x86_64

0 Answers0