In different assembly languages MUL (x86)/MULT (mips) refer to multiplication. It is a black box for the programmer. I am interested in how actually a CPU accomplishes a multiplication regardless of the architecture. Lets say I have two 16-bit values in my registers and I am the cpu, so I have to implement MUL using the other bit-fiddling instructions I have (and,or,xor,not,shl,shr,etc). What shall I do?
2 Answers
http://en.wikipedia.org/wiki/Multiplication_ALU on Wikipedia lists different methods for doing multiplication in a digital circuit.
When I worked on a project to add SIMD instructions to an DEC Alpha-like processor in Verilog back in college, we implemented a Wallace tree multiplier, the primary reason being it ran in a fixed number of cycles and was easy to pipeline.
Apparently, Dadda multipliers are (nearly?) universally used in real life CPU ALUs, including modern x86. Like Wallace multipliers, it can also be pipelined with fixed latency.
EDIT: You mentioned using the other bit fiddling instructions, on modern processors multiplication would not be microcoded like this; it'd be way to slow and the processor would get slaughtered in benchmarks.

- 328,167
- 45
- 605
- 847

- 54,279
- 5
- 125
- 144
-
I though cpus don't call their own instructions for efficiency reasons. I just had no other way to express myself, since the lowest level I have ever been so far is asm. Thanks for the help! – George Mar 28 '09 at 03:08
-
Sometimes they do. x86 is a complicated ISA and has some very strange instructions. These instructions are translated into an internal micro-code program. Look at http://en.wikipedia.org/wiki/File:Intel_Nehalem_arch.svg, you'll see a complex decode unit and a micro-code sequencer, which does this – Michael Mar 28 '09 at 03:14
-
It's even worse than that on modern CPUs - given out-of-order execution, branch prediction, hyperthreading, etc, along with microcode, it's almost fair to say that the x86 ISA runs in a virtual machine that's implemented in microcode and circuits. But it's almost never necessary to worry about it... – Jeff Shannon Apr 03 '09 at 05:08
-
@JeffShannon: `imul eax, ecx` decodes to a single uop on typical x86 CPUs as far back as Pentium Pro; there is a single fully-pipelined execution unit for scalar integer multiplies. (On some AMD CPUs before Ryzen, it wasn't fully pipelined, e.g. Bulldozer-family can start one `imul` every other clock cycle.) https://agner.org/optimize/. SIMD integer / FP multipliers are also single uop for most vector multiply operations. But Intel does microcode their integer division instructions. There's still a hardware divider unit, but it takes multiple uops to do a `div`. – Peter Cordes Oct 25 '18 at 02:17
-
TL:DR: only complex instructions like `call` (branching and pushing a return address) decode to multiple uops, or crazy stuff like `rep movsb` (memcpy). Frequently-used integer ALU instructions mostly decode to a single uop each. Out-of-order speculative execution doesn't change how the execution units are actually built; the OoO exec machinery exists to keep the execution units fed with work by finding ILP in a single thread. Yeah it's pretty different from an in-order pipeline, but whether you microcode a multiply or have dedicated HW is mostly independent of in-order vs. out-of-order exec. – Peter Cordes Oct 25 '18 at 02:24
This page shows the logic gates for a 4 * 4 combinational multipler. You can work up from there.
Here is somebody's lab where they describe building a 16 bit multiplier from 4 4 bit multipliers, each built with AND gates and full adders. Full design, chip layout, and simulation waveforms.

- 14,291
- 7
- 38
- 48
-
3
-
1They are still available on the Wayback Machine. https://web.archive.org/web/20091212185618/http://www-unix.ecs.umass.edu/~smckenna/ https://web.archive.org/web/20200216192105/http://www2.elo.utfsm.cl:80/~lsb/elo211/aplicaciones/katz/chapter5/chapter05.doc5.html – Siim Liiser Mar 07 '23 at 06:33