Most CPUs will allow you to start with two operands, each the size of a register, and multiply them together to get a result that fills two registers.
For example, on x86 if you multiply two 32-bit numbers, you'll get the upper 32 bits of the result in EDX and the lower 32 bits of the result in EAX. If you multiply two 64-bit numbers, you get the results in RDX and RAX instead.
On other processors, other registers are used, but the same basic idea applies: one register times one register gives a result that fills two registers.
C and C++ don't provide an easy way of taking advantage of that capability. When you operate on types smaller than int
, the input operands are converted to int
, then the ints are multiplied, and the result is an int. If the inputs are larger than int, then they're multiplied as the same type, and the result is the same type. Nothing is done to take into account that the result is twice as big as the input types, and virtually every processor on earth will produce a result twice as big as each input is individually.
There are, of course, ways of dealing with that. The simplest is the basic factoring we learned in grade school: take each number and break it up onto upper and lower halves. We can then multiply those pieces together individually: (a+b) * (c+d) = ac + ad + bc + bd. Since each of those multiplications has only half as many non-zero bits, we can do each piece of arithmetic as a half-size operation producing a full-sized result (plus a single bit carried out from the addition). For example, if we wanted to do 64-bit multiplication on a 64-bit processor to get a 128-bit result, we'd break each 64-bit input up into 32-bit pieces. Then each multiplication would produce a 64-bit result. We'd then add pieces together (with suitable bit-shifts) to get our final 128-bit result.
But, as Peter pointed out, when we do that, compilers are not smart enough to realize what we're trying to accomplish, and turn that sequence of multiplications and additions back into a single multiplication producing a result twice as large as each input. Instead, it translates the the expression fairly directly into a series of multiplications and additions, so it takes somewhere around four times longer than the single multiplication would have.