What is the Lower and the higher part of multiplication in assembly instructions

Question

I was reading this link, in short can someone explain the problem with current C++ compiler to someone who started learning about assembly x86 and 64bit a week ago.

Unfortunately current compilers don't optimize @craigster0's nice portable version, so if you want to take advantage of 64-bit CPUs, you can't use it except as a fallback for targets you don't have an #ifdef for. (I don't see a generic way to optimize it; you need a 128-bit type or an intrinsic.)

for clarification I was researching for the benefits of assembly when I came across people saying in multiple posts that the current compilers are not optimised when it comes to multiplication for the 64 bit because they use the lowest part so they do not perform full 64bit multiplication what does this means. so what is the meaning of getting the higher part also I read in a book I have that in the 64 bits architecture only the lowest 32 bits are used for the RFlags, Are these related I am confused?

Do you mean 32-bit x86? There aren't 86 bits anywhere; that's not even a multiple of 8. — Peter Cordes, Oct 23 '18 at 04:15
In that answer, I was saying that compilers don't optimize a 64x64 => 128-bit multiply using 32-bit chunks into `mul rcx` (http://felixcloutier.com/x86/MUL.html) to get a 128-bit result in one instruction. Instead they make asm that works more like the C source as written, doing 32-bit operations. This has nothing to do with RFLAGS, just the integer results of 64x64 => 128-bit widening multiplication. — Peter Cordes, Oct 23 '18 at 04:18
See https://www.swarthmore.edu/NatSci/echeeve1/Ref/BinaryMath/BinaryMath.html for some basic background on full multiplication. Kind of the opposite of [Why is the dividend 64 bits in x86 assembly?](https://stackoverflow.com/q/12586232) — Peter Cordes, Oct 23 '18 at 04:23
You use it as a building block for stuff like 128x128 => 128-bit multiplication, like [can multiprecision signed multiply be performed with imul instruction?](https://stackoverflow.com/q/20772280) is asking about. — Peter Cordes, Oct 23 '18 at 04:32
Visual Studio has the intrinsics [_mul128](https://msdn.microsoft.com/en-us/library/82cxdw50.aspx), and [_umul128](https://msdn.microsoft.com/en-us/library/3dayytw9.aspx), which stores the upper 64 bits of the product into the location pointed to by a parameter, and returns the lower 64 bits of the product. — rcgldr, Oct 23 '18 at 04:34
@PeterCordes yes this is what I meant, I fixed it and thanks I will look into these — , Oct 23 '18 at 16:42

Jerry Coffin · Accepted Answer · 2018-10-23T04:37:44.197

Most CPUs will allow you to start with two operands, each the size of a register, and multiply them together to get a result that fills two registers.

For example, on x86 if you multiply two 32-bit numbers, you'll get the upper 32 bits of the result in EDX and the lower 32 bits of the result in EAX. If you multiply two 64-bit numbers, you get the results in RDX and RAX instead.

On other processors, other registers are used, but the same basic idea applies: one register times one register gives a result that fills two registers.

C and C++ don't provide an easy way of taking advantage of that capability. When you operate on types smaller than int, the input operands are converted to int, then the ints are multiplied, and the result is an int. If the inputs are larger than int, then they're multiplied as the same type, and the result is the same type. Nothing is done to take into account that the result is twice as big as the input types, and virtually every processor on earth will produce a result twice as big as each input is individually.

There are, of course, ways of dealing with that. The simplest is the basic factoring we learned in grade school: take each number and break it up onto upper and lower halves. We can then multiply those pieces together individually: (a+b) * (c+d) = ac + ad + bc + bd. Since each of those multiplications has only half as many non-zero bits, we can do each piece of arithmetic as a half-size operation producing a full-sized result (plus a single bit carried out from the addition). For example, if we wanted to do 64-bit multiplication on a 64-bit processor to get a 128-bit result, we'd break each 64-bit input up into 32-bit pieces. Then each multiplication would produce a 64-bit result. We'd then add pieces together (with suitable bit-shifts) to get our final 128-bit result.

But, as Peter pointed out, when we do that, compilers are not smart enough to realize what we're trying to accomplish, and turn that sequence of multiplications and additions back into a single multiplication producing a result twice as large as each input. Instead, it translates the the expression fairly directly into a series of multiplications and additions, so it takes somewhere around four times longer than the single multiplication would have.

In some fairness... C is older than the processors that support this as it dates to a time and place when individual transistors were literally expensive. But that doesn't excuse it not adding facilities later. — Mgetz, Oct 23 '18 at 04:37
@Mgetz: Compilers will normally optimize `a * (uint64_t)b` into a single `mul` instruction, for `unsigned a,b;`, instead of actually doing the zero-extension and then 3 multiplies / adds. The problem is taking advantage of widening beyond the widest standard type. (Or in the post linked in the question, for taking advantage of 64x64 => 128 multiply because there's no portable type of 128-bit width, and GNU C only provides `__int128` on 64-bit systems, not on 32-bit systems where it would take even more operations.) — Peter Cordes, Oct 23 '18 at 04:42
@Mgetz: there were certainly processors at the time that lacked multiplication instructions entirely. At least to my knowledge, most at the time that had multiplication instructions produced results twice as large as the inputs. For example, the IBM 360 (dating back to 1964) has an MR instruction that takes two inputs in registers T and T+1, and produces a result in the same two registers. The other obvious exception would be a few machines specialized for floating point (e.g., Control Data) that only had multiplication for floating point operands. — Jerry Coffin, Oct 23 '18 at 04:42
To be somewhat fair to provide the proper support for multiply would break the already somewhat complicated rules for arithmetic promotion they would need to add whole "except for multiplication" clauses and it would also complicate the entire type-size relation - that is sizeof(char) <= sizeof(short) <= sizeof(int) <= sizeof(long) <= sizeof(long long) — SoronelHaetir, Oct 23 '18 at 05:17
@SoronelHaetir: Just to be clear: None of this was intended as an attack on C or C++, just pointing out how they work. In any case, I think it would be fairly easy to provide support if you really wanted. One obvious possibility would be a library function something like `template std::pair dmul(T a, T b);` that just received and returned two items (and yes, for the real thing, you'd need to add some `std::make_unsigned` and such, but you get the general idea). Alternatively, just add an arbitrary precision integer class, of course. — Jerry Coffin, Oct 23 '18 at 07:09

What is the Lower and the higher part of multiplication in assembly instructions

1 Answers1