How can I multiply 64 bit operands and get 128 bit result portably?

Question

For x64 I can use this:

 {
   uint64_t hi, lo;
  // hi,lo = 64bit x 64bit multiply of c[0] and b[0]

   __asm__("mulq %3\n\t"
    : "=d" (hi),
  "=a" (lo)
    : "%a" (c[0]),
  "rm" (b[0])
    : "cc" );

   a[0] += hi;
   a[1] += lo;
 }

But I'd like to perform the same calculation portably. For instance to work on x86.

What are the types of c[0] and b[0] ? Why not just multiply two uint64_t types? — brian beuning, Aug 02 '14 at 13:57
mulq is the 64 bit instruction that is the problem and c&b is uint64_t — user3360486, Aug 02 '14 at 14:03
write it in C++ for correctness. **MEASURE** if it's too slow. if it is, check the generated 32-bit assembly. — Cheers and hth. - Alf, Aug 02 '14 at 14:06
there is no any generated 32 bit assembly and its c not c++ why you navigate me to other ways than writing a basic solution — user3360486, Aug 02 '14 at 14:21
If it is C rather than C++, why did you tag the question C++? Very hard to understand why you would want to use asm to perform trivial multiplication. I also cannot understand your question. I don't know what you are asking. — David Heffernan, Aug 02 '14 at 14:28
deleted c++ tag and just want to make it without using inline asm thats all and i want be able to compile it for 32 bit machines mulq is 64 bit instruction — user3360486, Aug 02 '14 at 14:33
OK, you want a pure C 64 bit multiplication that produces a 128 bit result? Does this do what you need? http://www.codeproject.com/Tips/618570/UInt-Multiplication-Squaring — David Heffernan, Aug 02 '14 at 14:37
Or maybe this is what you're looking for: http://stackoverflow.com/questions/8776073/efficient-multiply-divide-of-two-128-bit-integers-on-x86-no-64-bit — Cody Gray - on strike, Aug 02 '14 at 14:41

score 18 · Accepted Answer · answered Aug 02 '14 at 14:44

18

As I understand the question, you want a portable pure C implementation of 64 bit multiplication, with output to a 128 bit value, stored in two 64 bit values. In which case this article purports to have what you need. That code is written for C++. It doesn't take much to turn it into C code:

void mult64to128(uint64_t op1, uint64_t op2, uint64_t *hi, uint64_t *lo)
{
    uint64_t u1 = (op1 & 0xffffffff);
    uint64_t v1 = (op2 & 0xffffffff);
    uint64_t t = (u1 * v1);
    uint64_t w3 = (t & 0xffffffff);
    uint64_t k = (t >> 32);

    op1 >>= 32;
    t = (op1 * v1) + k;
    k = (t & 0xffffffff);
    uint64_t w1 = (t >> 32);

    op2 >>= 32;
    t = (u1 * op2) + k;
    k = (t >> 32);

    *hi = (op1 * op2) + w1 + k;
    *lo = (t << 32) + w3;
}

answered Aug 02 '14 at 14:44

David Heffernan

601,492
42
1,072
1,490

This uses 4 multiplications. You could reduce it to 3 by using Karatsuba's trick. Not sure how much it would increase speed. – Thomas Ahle Jan 24 '23 at 17:58
@ThomasAhle: It wouldn't; modern CPUs typically have pipelined multipliers, so you can get 1 result per 1 or 2 clocks. Having to do a lot more adds is worse, especially when you have to work around C's lack of an add-with-carry primitive operation so 128-bit additions cost even more than they would in asm. – Peter Cordes Jan 29 '23 at 14:58
You mean it'd be slow because we'd have to worry about overflow when adding two 64 bit numbers? Though sometimes you know the numbers are less than 2^60, say, so you can split in 30 bit chucks and have room to spare for carries. – Thomas Ahle Feb 12 '23 at 23:14
@ThomasAhle: didn't see your reply since you didn't @ notify me. It'd be slower because you have significantly more total operations. Multiply isn't that slow compared to addition, and often a more important consideration is total number of micro-uops. (Usually one per instruction). Especially with out-of-order exec being able to overlap execution with surrounding code that hopefully doesn't bottleneck on multiply throughput. If actually compiling this C for a 64-bit CPU, the asm for just this will have a lot of non-multiply instructions to split up 64-bit integer regs to 32-bit halves... – Peter Cordes Jul 29 '23 at 09:59
Working with 30-bit or 60-bit chunks can be a good technique if you're limited to pure C. Although probably not needed here since none of these 64-bit additions need to generate carry-out or take carry-in. – Peter Cordes Jul 29 '23 at 10:01

score 13 · Answer 2 · answered Mar 05 '15 at 04:49

13

Since you have gcc as a tag, note that you can just use gcc's 128-bit integer type:

typedef unsigned __int128 uint128_t;
// ...
uint64_t x, y;
// ...
uint128_t result = (uint128_t)x * y;
uint64_t lo = result;
uint64_t hi = result >> 64;

answered Mar 05 '15 at 04:49

[Getting the high part of 64 bit integer multiplication](//stackoverflow.com/a/50958815) shows an ifdeffed version of that with MSVC's `_umulh` intrinsic. MSVC also has a `_umul128`. – Peter Cordes Oct 19 '19 at 00:17

EasyasPi · Answer 3 · 2019-10-14T19:00:43.317

The accepted solution isn't really the best solution, in my opinion.

It is confusing to read.
It has some funky carry handling.
It doesn't take advantage of the fact that 64-bit arithmetic may be available.
It displeases ARMv6, the God of Absolutely Ridiculous Multiplies. Whoever uses UMAAL shall not lag but have eternal 64-bit to 128-bit multiplies in 4 instructions.

Joking aside, it is much better to optimize for ARMv6 than any other platform because it will have the most benefit. x86 needs a complicated routine and it would be a dead end optimization.

The best way I have found (and used in xxHash3) is this, which takes advantage of multiple implementations using macros:

It is a tiny bit slower than mult64to128 on x86 (by 1-2 instructions), but a lot faster on ARMv6.

#include <stdint.h>
#ifdef _MSC_VER
#  include <intrin.h>
#endif

/* Prevents a partial vectorization from GCC. */
#if defined(__GNUC__) && !defined(__clang__) && defined(__i386__)
  __attribute__((__target__("no-sse")))
#endif
static uint64_t multiply64to128(uint64_t lhs, uint64_t rhs, uint64_t *high)
{
    /*
     * GCC and Clang usually provide __uint128_t on 64-bit targets,
     * although Clang also defines it on WASM despite having to use
     * builtins for most purposes - including multiplication.
     */
#if defined(__SIZEOF_INT128__) && !defined(__wasm__)
    __uint128_t product = (__uint128_t)lhs * (__uint128_t)rhs;
    *high = (uint64_t)(product >> 64);
    return (uint64_t)(product & 0xFFFFFFFFFFFFFFFF);

    /* Use the _umul128 intrinsic on MSVC x64 to hint for mulq. */
#elif defined(_MSC_VER) && defined(_M_IX64)
#   pragma intrinsic(_umul128)
    /* This intentionally has the same signature. */
    return _umul128(lhs, rhs, high);

#else
    /*
     * Fast yet simple grade school multiply that avoids
     * 64-bit carries with the properties of multiplying by 11
     * and takes advantage of UMAAL on ARMv6 to only need 4
     * calculations.
     */

    /* First calculate all of the cross products. */
    uint64_t lo_lo = (lhs & 0xFFFFFFFF) * (rhs & 0xFFFFFFFF);
    uint64_t hi_lo = (lhs >> 32)        * (rhs & 0xFFFFFFFF);
    uint64_t lo_hi = (lhs & 0xFFFFFFFF) * (rhs >> 32);
    uint64_t hi_hi = (lhs >> 32)        * (rhs >> 32);

    /* Now add the products together. These will never overflow. */
    uint64_t cross = (lo_lo >> 32) + (hi_lo & 0xFFFFFFFF) + lo_hi;
    uint64_t upper = (hi_lo >> 32) + (cross >> 32)        + hi_hi;

    *high = upper;
    return (cross << 32) | (lo_lo & 0xFFFFFFFF);
#endif /* portable */
}

On ARMv6, you can't get much better than this, at least on Clang:

multiply64to128:
        push    {r4, r5, r11, lr}
        umull   r12, r5, r2, r0
        umull   r2, r4, r2, r1
        umaal   r2, r5, r3, r0
        umaal   r4, r5, r3, r1
        ldr     r0, [sp, #16]
        mov     r1, r2
        strd    r4, r5, [r0]
        mov     r0, r12
        pop     {r4, r5, r11, pc}

The accepted solution generates a bunch of adds and adc, as well as an extra umull in Clang due to an instcombine bug.

I further explain the portable method in the link I posted.

`__attribute__((__target__("no-sse")))` will probably stop it from inlining into functions with different target options, which might defeat constant propagation as well as adding call/ret overhead (especially in the nasty stack-args calling convention most 32-bit code uses). But that's only for 32-bit x86 so it's probably not going to hurt many use-cases. — Peter Cordes, Oct 14 '19 at 17:49
That is true. However, from testing on a Sandy Bridge, the shuffles completely bottleneck the algorithm. — EasyasPi, Oct 14 '19 at 18:59
Have you reported the missed-optimization bug to gcc's bugzilla? I was just pointing out that the workaround isn't perfect, but IDK if there's a way to do anything cheaper without using `-fno-tree-vectorize` for the whole file. Your attribute is might well be the best choice if `-O3 -march=native` shoots itself in the foot that badly. — Peter Cordes, Oct 15 '19 at 01:42

How can I multiply 64 bit operands and get 128 bit result portably?

3 Answers3

Linked

Related