x86 MUL operation at hardware level

Question

I understand the the x86 operation to perform integer multiplication of two numbers (e.g. on 64 bits) is MUL.

My question is, how is this operation generally implemented at the hardware level? (for instance, on a modern Intel processor). Also, is it executed in a single CPU cycle?

Hello! This doesn't appear to be a programming / software development question. Questions about the electrical design / implementation of CPUs will likely be closed as off-topic on this site, but may be suitable for https://electronics.stackexchange.com — Brian61354270, Mar 01 '23 at 16:15
The latency (of the `rax` result) is generally 3 cycles on high performance x86 cores or 5 on the low power x86 cores (Intel Atom and the E-cores in hybrid architectures), source: https://uops.info/table.html So not single-cycle, but fast enough that it has to involve a proper high-performance multiplier, not just a shift-and-add loop. — harold, Mar 01 '23 at 20:39
Normally you'd use `imul reg,reg`, unless you actually need the high half of the result. It's single-uop with 3 cycle latency on modern Intel and AMD. (Or more on E cores). But it is fully pipelined, able to start an independent multiply every clock cycle, on Intel for a while and AMD since Zen 1. (Bulldozer's multipliers weren't fully pipelined, with 64-bit operand-size being extra slow.) See https://uops.info/ and https://agner.org/optimize/ — Peter Cordes, Mar 02 '23 at 02:22
@Brian hello, instruction latencies and the other scheduling details are absolutely programming-related questions. Still, there are so many x86 cores out there that the question cannot have a meaningful answer. — SK-logic, Mar 03 '23 at 09:15
And, just for comparison, a single-cycle 32-bit integer multiplication is indeed possible and many ARM cores implement it, using a Dadda multiplier with large lookup tables. — SK-logic, Mar 03 '23 at 09:18

x86 MUL operation at hardware level

0 Answers0