43

In C++, say that:

uint64_t i;
uint64_t j;

then i * j will yield an uint64_t that has as value the lower part of the multiplication between i and j, i.e., (i * j) mod 2^64. Now, what if I wanted the higher part of the multiplication? I know that there exists an assembly instruction to do something like that when using 32 bit integers, but I am not familiar at all with assembly, so I was hoping for help.

What is the most efficient way to make something like:

uint64_t k = mulhi(i, j);
lehins
  • 9,642
  • 2
  • 35
  • 49
Matteo Monti
  • 8,362
  • 19
  • 68
  • 114
  • 2
    Reference: http://blogs.msdn.com/b/oldnewthing/archive/2014/12/08/10578956.aspx –  Mar 05 '15 at 01:26
  • GCC has `uint128_t` for this purpose. Visual Studio has no such option though. – Mooing Duck Mar 05 '15 at 01:26
  • @MooingDuck Looks like uint128_t doesn't exist under my environment (I am using Xcode under osx). Moreover, that will explicitly compute both the higher and lower part of that multiplication, which I would like to avoid. – Matteo Monti Mar 05 '15 at 01:49
  • @NickyC thanks for the reference! As I said, I have little to no experience with assembly. Could you provide a simple example code that will do what I need? Sorry, I should definitely study assembly once and for all! – Matteo Monti Mar 05 '15 at 01:52
  • 2
    @MatteoMonti It is not feasible to compute higher part without lower part because the carry from lower part propagates to the higher part. –  Mar 05 '15 at 02:00
  • 1
    @MatteoMonti It is not about assembly. I am just trying to show you the Maths. –  Mar 05 '15 at 02:02
  • If performance is not a big concern, try an arbitrary length integer class to get the result. – Neil Kirk Mar 05 '15 at 02:04
  • @NeilKirk performance is my main concern, actually... – Matteo Monti Mar 05 '15 at 02:36
  • So if I migrate to a platform where I have `uint128_t` that is likely to be the most efficient way to do what i need? – Matteo Monti Mar 05 '15 at 02:37
  • 1
    If performance is the real concern. You need to learn enough assembly to code this inline. On a 64 bit processor, there will (should? ) be instructions to multiply upper and lower 32 bit numbers. – Degustaf Mar 05 '15 at 04:27
  • 3
    http://stackoverflow.com/questions/25095741/how-can-i-multiply-64-bit-operands-and-get-128-bit-result-portably http://stackoverflow.com/questions/28766755/compute-the-doubleword-product-signed-of-two-words-given-the-lower-word-produc?lq=1 http://stackoverflow.com/questions/87771/how-can-i-multiply-two-64-bit-numbers-using-x86-assembly-language http://stackoverflow.com/questions/28807341/simd-signed-with-unsigned-multiplication-for-64-bit-64-bit-to-128-bit#comment45891221_28807341 – phuclv Mar 05 '15 at 04:38
  • 2
    There is `__int128` in [gcc](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fint128.html) as well as [llvm including Apple Clang](http://stackoverflow.com/a/16448877/995714). http://stackoverflow.com/questions/13187629/gcc-intrinsic-for-extended-division-multiplication – phuclv Mar 05 '15 at 04:46
  • 4
    Some more [high bits of long multiplication in Java?](http://stackoverflow.com/questions/18859207/high-bits-of-long-multiplication-in-java) [Computing high 64 bits of a 64x64 int product in C](http://stackoverflow.com/questions/1541426/computing-high-64-bits-of-a-64x64-int-product-in-c) [Reasonably portable way to get top 64-bits from 64x64 bit multiply?](http://stackoverflow.com/questions/26852435/reasonably-portable-way-to-get-top-64-bits-from-64x64-bit-multiply) [Pure high-bit multiplication in assembly?](http://stackoverflow.com/questions/10478253/pure-high-bit-multiplication-in-assembly) – phuclv Mar 05 '15 at 14:01
  • @LưuVĩnhPhúc thank you, I think I will just use 128 bits multiplication at this point. That sounds more performing than any other solution I could implement on my own, since I suppose that any possible optimization must have already been implemented by those that developed the compiler. – Matteo Monti Mar 05 '15 at 14:14
  • 1
    @phuclv: This question is not a duplicate of the one linked. That other question is focusing on 32 bit multiplications while this one is focusing on 64 bit multiplications. When people come to this question they follow the link (like I did) and have to go back to this question. I think it should be reopened (and maybe closed again with a better dup). – Arnaud Jun 19 '18 at 16:09
  • @Arnaud there should be no difference. Simply double every variable type and the problem is solved – phuclv Jun 19 '18 at 16:17
  • but yes, probably the other question doesn't have good enough generic answer – phuclv Jun 19 '18 at 16:24
  • @phuclv You cannot double the variable types as easily. You would need a 128 bit integer type. – Arnaud Jun 19 '18 at 17:37
  • @Arnaud no, you don't need that *just to take the higher part of a 64x64 multiplication*, for example just widen the assembly instructions in the other question and you're good to go. And did you see my other linked questions? – phuclv Jun 20 '18 at 00:53
  • 1
    @phuclv This is a C++ question, not an assembly question. Of course I can find a solution in assembly multiplying two 64 bit registers. The whole point is to know whether this is possible portably in C++. And if you are stuck in portable C++, the 32 bit question has a trivial answer (multiply two std::uint64_t) and the 64 bit question is difficult (because we don't have a std::uint128_t) – Arnaud Jun 20 '18 at 10:49
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/173467/discussion-between-arnaud-and-phuclv). – Arnaud Jun 20 '18 at 11:12
  • A better dupe is [Computing high 64 bits of a 64x64 int product in C](/q/1541426) - and that has [an answer](/a/1541458) that clearly shows how to derive good results for similar problems. – Toby Speight Jun 20 '18 at 14:26

5 Answers5

27

If you're using gcc and the version you have supports 128 bit numbers (try using __uint128_t) then performing the 128 multiply and extracting the upper 64 bits is likely to be the most efficient way of getting the result.

If your compiler doesn't support 128 bit numbers, then Yakk's answer is correct. However, it may be too brief for general consumption. In particular, an actual implementation has to be careful of overflowing 64 bit integers.

The simple and portable solution he proposes is to break each of a and b into 2 32-bit numbers and then multiply those 32 bit numbers using the 64 bit multiply operation. If we write:

uint64_t a_lo = (uint32_t)a;
uint64_t a_hi = a >> 32;
uint64_t b_lo = (uint32_t)b;
uint64_t b_hi = b >> 32;

then it is obvious that:

a = (a_hi << 32) + a_lo;
b = (b_hi << 32) + b_lo;

and:

a * b = ((a_hi << 32) + a_lo) * ((b_hi << 32) + b_lo)
      = ((a_hi * b_hi) << 64) +
        ((a_hi * b_lo) << 32) +
        ((b_hi * a_lo) << 32) +
          a_lo * b_lo

provided the calculation is performed using 128 bit (or greater) arithmetic.

But this problem requires that we perform all the calculcations using 64 bit arithmetic, so we have to worry about overflow.

Since a_hi, a_lo, b_hi, and b_lo are all unsigned 32 bit numbers, their product will fit in an unsigned 64 bit number without overflow. However, the intermediate results of the above calculation will not.

The following code will implement mulhi(a, b) when the mathemetics must be performed modulo 2^64:

uint64_t    a_lo = (uint32_t)a;
uint64_t    a_hi = a >> 32;
uint64_t    b_lo = (uint32_t)b;
uint64_t    b_hi = b >> 32;

uint64_t    a_x_b_hi =  a_hi * b_hi;
uint64_t    a_x_b_mid = a_hi * b_lo;
uint64_t    b_x_a_mid = b_hi * a_lo;
uint64_t    a_x_b_lo =  a_lo * b_lo;

uint64_t    carry_bit = ((uint64_t)(uint32_t)a_x_b_mid +
                         (uint64_t)(uint32_t)b_x_a_mid +
                         (a_x_b_lo >> 32) ) >> 32;

uint64_t    multhi = a_x_b_hi +
                     (a_x_b_mid >> 32) + (b_x_a_mid >> 32) +
                     carry_bit;

return multhi;
                                              

As Yakk points out, if you don't mind being off by +1 in the upper 64 bits, you can omit the calculation of the carry bit.

András Kovács
  • 29,931
  • 3
  • 53
  • 99
craigster0
  • 483
  • 3
  • 9
  • This doesn't seem to work. I've pasted the code in to an online compiler, set `a` and `b` to some numbers, but multhi is always `0`. – intrigued_66 Dec 11 '22 at 23:35
  • 1
    This does work (albeit slowly on 64-bit ISAs because portable ISO C sucks at exposing cool stuff most CPUs can do). The previous commenter was using it with small inputs such that the product fit in the low 64 bits, so the zero was the correct high half result. (For details, see [the question they asked](https://stackoverflow.com/questions/74765727/multiplying-two-64-bit-numbers-using-64-bit-registers-by-splitting-in-to-32-bit).) – Peter Cordes Dec 19 '22 at 20:52
17

TL:DR with GCC for a 64-bit ISA: (a * (unsigned __int128)b) >> 64 compiles nicely, to a single full-multiply or high-half multiply instruction. No need to mess around with inline asm.


Unfortunately current compilers don't optimize @craigster0's nice portable version, so if you want to take advantage of 64-bit CPUs, you can't use it except as a fallback for targets you don't have an #ifdef for. (I don't see a generic way to optimize it; you need a 128-bit type or an intrinsic.)


GNU C (gcc, clang, or ICC) has unsigned __int128 on most 64-bit platforms. (Or in older versions, __uint128_t). GCC doesn't implement this type on 32-bit platforms, though.

This is an easy and efficient way to get the compiler to emit a 64-bit full-multiply instruction and keep the high half. (GCC knows that a uint64_t cast to a 128-bit integer still has the upper half all zero, so you don't get a 128-bit multiply using three 64-bit multiplies.)

MSVC also has a __umulh intrinsic for 64-bit high-half multiplication, but again it's only available on 64-bit platforms (and specifically x86-64 and AArch64. The docs also mention IPF (IA-64) having _umul128 available, but I don't have MSVC for Itanium available. (Probably not relevant anyway.)

#define HAVE_FAST_mul64 1

#ifdef __SIZEOF_INT128__     // GNU C
 static inline
 uint64_t mulhi64(uint64_t a, uint64_t b) {
     unsigned __int128 prod =  a * (unsigned __int128)b;
     return prod >> 64;
 }

#elif defined(_M_X64) || defined(_M_ARM64)     // MSVC
   // MSVC for x86-64 or AArch64
   // possibly also  || defined(_M_IA64) || defined(_WIN64)
   // but the docs only guarantee x86-64!  Don't use *just* _WIN64; it doesn't include AArch64 Android / Linux

  // https://learn.microsoft.com/en-gb/cpp/intrinsics/umulh
  #include <intrin.h>
  #define mulhi64 __umulh

#elif defined(_M_IA64) // || defined(_M_ARM)       // MSVC again
  // https://learn.microsoft.com/en-gb/cpp/intrinsics/umul128
  // incorrectly say that _umul128 is available for ARM
  // which would be weird because there's no single insn on AArch32
  #include <intrin.h>
  static inline
  uint64_t mulhi64(uint64_t a, uint64_t b) {
     unsigned __int64 HighProduct;
     (void)_umul128(a, b, &HighProduct);
     return HighProduct;
  }

#else

# undef HAVE_FAST_mul64
  uint64_t mulhi64(uint64_t a, uint64_t b);  // non-inline prototype
  // or you might want to define @craigster0's version here so it can inline.
#endif

For x86-64, AArch64, and PowerPC64 (and others), this compiles to one mul instruction, and a couple movs to deal with the calling convention (which should optimize away after this inlines). From the Godbolt compiler explorer (with source + asm for x86-64, PowerPC64, and AArch64):

     # x86-64 gcc7.3.  clang and ICC are the same.  (x86-64 System V calling convention)
     # MSVC makes basically the same function, but with different regs for x64 __fastcall
    mov     rax, rsi
    mul     rdi              # RDX:RAX = RAX * RDI
    mov     rax, rdx
    ret

(or with clang -march=haswell to enable BMI2: mov rdx, rsi / mulx rax, rcx, rdi to put the high-half in RAX directly. gcc is dumb and still uses an extra mov.)

For AArch64 (with gcc unsigned __int128 or MSVC with __umulh):

test_var:
    umulh   x0, x0, x1
    ret

With a compile-time constant power of 2 multiplier, we usually get the expected right-shift to grab a few high bits. But gcc amusingly uses shld (see the Godbolt link).


Unfortunately current compilers don't optimize @craigster0's nice portable version. You get 8x shr r64,32, 4x imul r64,r64, and a bunch of add/mov instructions for x86-64. i.e. it compiles to a lot of 32x32 => 64-bit multiplies and unpacks of the results. So if you want something that takes advantage of 64-bit CPUs, you need some #ifdefs.

A full-multiply mul 64 instruction is 2 uops on Intel CPUs, but still only 3 cycle latency, same as imul r64,r64 which only produces a 64-bit result. So the __int128 / intrinsic version is 5 to 10 times cheaper in latency and throughput (impact on surrounding code) on modern x86-64 than the portable version, from a quick eyeball guess based on http://agner.org/optimize/.

Check it out on the Godbolt compiler explorer on the above link.

gcc does fully optimize this function when multiplying by 16, though: you get a single right shift, more efficient than with unsigned __int128 multiply.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Related: it seems I wrote an earlier version of this answer on a different question a couple years before this: [Getting the high half and low half of a full integer multiply](https://stackoverflow.com/a/35198488) – Peter Cordes Apr 22 '21 at 09:42
10

This is a unit-tested version that I came up with tonight that provides the full 128-bit product. On inspection it seems to be simpler than most of the other solutions online (in e.g. Botan library and other answers here) because it takes advantage of the how the MIDDLE PART does not overflow as explained in the code comments.

For context I wrote it for this github project: https://github.com/catid/fp61

//------------------------------------------------------------------------------
// Portability Macros

// Compiler-specific force inline keyword
#ifdef _MSC_VER
# define FP61_FORCE_INLINE inline __forceinline
#else
# define FP61_FORCE_INLINE inline __attribute__((always_inline))
#endif


//------------------------------------------------------------------------------
// Portable 64x64->128 Multiply
// CAT_MUL128: r{hi,lo} = x * y

// Returns low part of product, and high part is set in r_hi
FP61_FORCE_INLINE uint64_t Emulate64x64to128(
    uint64_t& r_hi,
    const uint64_t x,
    const uint64_t y)
{
    const uint64_t x0 = (uint32_t)x, x1 = x >> 32;
    const uint64_t y0 = (uint32_t)y, y1 = y >> 32;
    const uint64_t p11 = x1 * y1, p01 = x0 * y1;
    const uint64_t p10 = x1 * y0, p00 = x0 * y0;
    /*
        This is implementing schoolbook multiplication:

                x1 x0
        X       y1 y0
        -------------
                   00  LOW PART
        -------------
                00
             10 10     MIDDLE PART
        +       01
        -------------
             01 
        + 11 11        HIGH PART
        -------------
    */

    // 64-bit product + two 32-bit values
    const uint64_t middle = p10 + (p00 >> 32) + (uint32_t)p01;

    /*
        Proof that 64-bit products can accumulate two more 32-bit values
        without overflowing:

        Max 32-bit value is 2^32 - 1.
        PSum = (2^32-1) * (2^32-1) + (2^32-1) + (2^32-1)
             = 2^64 - 2^32 - 2^32 + 1 + 2^32 - 1 + 2^32 - 1
             = 2^64 - 1
        Therefore it cannot overflow regardless of input.
    */

    // 64-bit product + two 32-bit values
    r_hi = p11 + (middle >> 32) + (p01 >> 32);

    // Add LOW PART and lower half of MIDDLE PART
    return (middle << 32) | (uint32_t)p00;
}

#if defined(_MSC_VER) && defined(_WIN64)
// Visual Studio 64-bit

# include <intrin.h>
# pragma intrinsic(_umul128)
# define CAT_MUL128(r_hi, r_lo, x, y) \
    r_lo = _umul128(x, y, &(r_hi));

#elif defined(__SIZEOF_INT128__)
// Compiler supporting 128-bit values (GCC/Clang)

# define CAT_MUL128(r_hi, r_lo, x, y)                   \
    {                                                   \
        unsigned __int128 w = (unsigned __int128)x * y; \
        r_lo = (uint64_t)w;                             \
        r_hi = (uint64_t)(w >> 64);                     \
    }

#else
// Emulate 64x64->128-bit multiply with 64x64->64 operations

# define CAT_MUL128(r_hi, r_lo, x, y) \
    r_lo = Emulate64x64to128(r_hi, x, y);

#endif // End CAT_MUL128
catid
  • 116
  • 1
  • 3
  • (Your comments alternate between *above the code* and *below the code*.) – greybeard Aug 02 '18 at 20:52
  • 2
    I ported this to [C#](https://gist.github.com/cocowalla/6070a53445e872f2bb24304712a3e1d2), and it's faster than any other 64x64 function I've come across! – Cocowalla Apr 23 '19 at 09:17
  • Don't know if it matters, but this breaks on Aarch64. `_umul128` is not available. – jww Oct 18 '19 at 15:50
  • How about signed multiply? Does that just remember the sign, do unsigned multiply, and apply the sign? – NightElfik Sep 21 '21 at 04:06
4

Long multiplication should be ok performance.

Split a*b into (hia+loa)*(hib+lob). This gives 4 32 bit multiplies plus some shifts. Do them in 64 bits, and do the carries manually, and you'll get the high portion.

Note that an approximation of the high portion can be done with fewer multiplies -- accurate within 2^33 or so with 1 multiply, and within 1 with 3 multiplies.

I do not think there is a portable alternative.

Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
-3

Here's the asm for ARMv8 or Aarch64 version:

// High (p1) and low (p0) product
uint64_t p0, p1;
// multiplicand and multiplier
uint64_t a = ..., b = ...;

p0 = a*b; asm ("umulh %0,%1,%2" : "=r"(p1) : "r"(a), "r"(b));

And here's the asm for old DEC compilers:

p0 = a*b; p1 = asm("umulh %a0, %a1, %v0", a, b);

If you have x86's BMI2 and would like to use mulxq:

asm ("mulxq %3, %0, %1" : "=r"(p0), "=r"(p1) : "d"(a), "r"(b));

And the generic x86 multiply using mulq:

asm ("mulq %3" : "=a"(p0), "=d"(p1) : "a"(a), "g"(b) : "cc");
jww
  • 97,681
  • 90
  • 411
  • 885
  • 1
    This is GNU C inline asm, which means you could have used `unsigned __int128` instead, like my answer shows. What's the use-case for this? Do some GCC or clang versions fail to emit just a `umulh` for `(a * (unsigned __int128)b) >> 64`? Oh, I just had a look at my answer, and it shows AArch64 GCC emitting `umulh`. – Peter Cordes Oct 18 '19 at 21:29
  • @Peter - You did not answer OP's question. He wanted the asm for the operation; not a disassembly of C code. – jww Oct 18 '19 at 21:34
  • 2
    That's obviously not the case; the accepted answer is pure C++ with no mention of asm or inline asm. I would highly recommend against future readers ever using inline asm for this, especially on a 64-bit target, so showing how to wrap `umulh` with GNU C inline asm seems totally useless to me. Especially when my answer already shows that the instruction exists. – Peter Cordes Oct 18 '19 at 21:46
  • Whatever, @Peter. I try to answer the question that was asked. You are free to answer the question you wished was asked. – jww Oct 18 '19 at 21:49
  • 3
    I don't think I'm engaging in wishful thinking. I see we disagree over interpreting the question (and/or what might be useful to future readers). The question says "but I am not familiar at all with assembly, so I was hoping for help." They're avoiding an XY problem by asking how to get the high half in C++, with inline asm as an *option* they thing might be useful, not a requirement. It's not even tagged `[inline-assembly]`. – Peter Cordes Oct 18 '19 at 21:58
  • Anyway, I'm also free to downvote answers I think are poor advice for future readers, and have done so. Is there a compiler that accepts this but not `__int128`? (Other than ancient gcc). If so, say so in your answer and I'll upvote. – Peter Cordes Oct 18 '19 at 21:59
  • `"g"` includes immediate, which x86 mul doesn't support. https://godbolt.org/z/r0NeIi. Probably best to use `"r"` to stop clang from shooting itself in the foot and storing first if you give it the option of memory. – Peter Cordes Oct 20 '19 at 16:46
  • 2
    The OP clearly isn't asking specifically for assembly. They only say they know such assembly exists and ask for the best way to get the compiler to emit those instructions. Inline assembly hasn't been advisable where it can be avoided for well over a decade. – Tim Seguine Mar 12 '21 at 21:04