11

For x64 I can use this:

 {
   uint64_t hi, lo;
  // hi,lo = 64bit x 64bit multiply of c[0] and b[0]

   __asm__("mulq %3\n\t"
    : "=d" (hi),
  "=a" (lo)
    : "%a" (c[0]),
  "rm" (b[0])
    : "cc" );

   a[0] += hi;
   a[1] += lo;
 }

But I'd like to perform the same calculation portably. For instance to work on x86.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
user3360486
  • 153
  • 1
  • 5

3 Answers3

18

As I understand the question, you want a portable pure C implementation of 64 bit multiplication, with output to a 128 bit value, stored in two 64 bit values. In which case this article purports to have what you need. That code is written for C++. It doesn't take much to turn it into C code:

void mult64to128(uint64_t op1, uint64_t op2, uint64_t *hi, uint64_t *lo)
{
    uint64_t u1 = (op1 & 0xffffffff);
    uint64_t v1 = (op2 & 0xffffffff);
    uint64_t t = (u1 * v1);
    uint64_t w3 = (t & 0xffffffff);
    uint64_t k = (t >> 32);

    op1 >>= 32;
    t = (op1 * v1) + k;
    k = (t & 0xffffffff);
    uint64_t w1 = (t >> 32);

    op2 >>= 32;
    t = (u1 * op2) + k;
    k = (t >> 32);

    *hi = (op1 * op2) + w1 + k;
    *lo = (t << 32) + w3;
}
David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
  • This uses 4 multiplications. You could reduce it to 3 by using Karatsuba's trick. Not sure how much it would increase speed. – Thomas Ahle Jan 24 '23 at 17:58
  • @ThomasAhle: It wouldn't; modern CPUs typically have pipelined multipliers, so you can get 1 result per 1 or 2 clocks. Having to do a lot more adds is worse, especially when you have to work around C's lack of an add-with-carry primitive operation so 128-bit additions cost even more than they would in asm. – Peter Cordes Jan 29 '23 at 14:58
  • You mean it'd be slow because we'd have to worry about overflow when adding two 64 bit numbers? Though sometimes you know the numbers are less than 2^60, say, so you can split in 30 bit chucks and have room to spare for carries. – Thomas Ahle Feb 12 '23 at 23:14
  • @ThomasAhle: didn't see your reply since you didn't @ notify me. It'd be slower because you have significantly more total operations. Multiply isn't that slow compared to addition, and often a more important consideration is total number of micro-uops. (Usually one per instruction). Especially with out-of-order exec being able to overlap execution with surrounding code that hopefully doesn't bottleneck on multiply throughput. If actually compiling this C for a 64-bit CPU, the asm for just this will have a lot of non-multiply instructions to split up 64-bit integer regs to 32-bit halves... – Peter Cordes Jul 29 '23 at 09:59
  • Working with 30-bit or 60-bit chunks can be a good technique if you're limited to pure C. Although probably not needed here since none of these 64-bit additions need to generate carry-out or take carry-in. – Peter Cordes Jul 29 '23 at 10:01
13

Since you have gcc as a tag, note that you can just use gcc's 128-bit integer type:

typedef unsigned __int128 uint128_t;
// ...
uint64_t x, y;
// ...
uint128_t result = (uint128_t)x * y;
uint64_t lo = result;
uint64_t hi = result >> 64;
  • [Getting the high part of 64 bit integer multiplication](//stackoverflow.com/a/50958815) shows an ifdeffed version of that with MSVC's `_umulh` intrinsic. MSVC also has a `_umul128`. – Peter Cordes Oct 19 '19 at 00:17
6

The accepted solution isn't really the best solution, in my opinion.

  • It is confusing to read.
  • It has some funky carry handling.
  • It doesn't take advantage of the fact that 64-bit arithmetic may be available.
  • It displeases ARMv6, the God of Absolutely Ridiculous Multiplies. Whoever uses UMAAL shall not lag but have eternal 64-bit to 128-bit multiplies in 4 instructions.

Joking aside, it is much better to optimize for ARMv6 than any other platform because it will have the most benefit. x86 needs a complicated routine and it would be a dead end optimization.

The best way I have found (and used in xxHash3) is this, which takes advantage of multiple implementations using macros:

It is a tiny bit slower than mult64to128 on x86 (by 1-2 instructions), but a lot faster on ARMv6.

#include <stdint.h>
#ifdef _MSC_VER
#  include <intrin.h>
#endif

/* Prevents a partial vectorization from GCC. */
#if defined(__GNUC__) && !defined(__clang__) && defined(__i386__)
  __attribute__((__target__("no-sse")))
#endif
static uint64_t multiply64to128(uint64_t lhs, uint64_t rhs, uint64_t *high)
{
    /*
     * GCC and Clang usually provide __uint128_t on 64-bit targets,
     * although Clang also defines it on WASM despite having to use
     * builtins for most purposes - including multiplication.
     */
#if defined(__SIZEOF_INT128__) && !defined(__wasm__)
    __uint128_t product = (__uint128_t)lhs * (__uint128_t)rhs;
    *high = (uint64_t)(product >> 64);
    return (uint64_t)(product & 0xFFFFFFFFFFFFFFFF);

    /* Use the _umul128 intrinsic on MSVC x64 to hint for mulq. */
#elif defined(_MSC_VER) && defined(_M_IX64)
#   pragma intrinsic(_umul128)
    /* This intentionally has the same signature. */
    return _umul128(lhs, rhs, high);

#else
    /*
     * Fast yet simple grade school multiply that avoids
     * 64-bit carries with the properties of multiplying by 11
     * and takes advantage of UMAAL on ARMv6 to only need 4
     * calculations.
     */

    /* First calculate all of the cross products. */
    uint64_t lo_lo = (lhs & 0xFFFFFFFF) * (rhs & 0xFFFFFFFF);
    uint64_t hi_lo = (lhs >> 32)        * (rhs & 0xFFFFFFFF);
    uint64_t lo_hi = (lhs & 0xFFFFFFFF) * (rhs >> 32);
    uint64_t hi_hi = (lhs >> 32)        * (rhs >> 32);

    /* Now add the products together. These will never overflow. */
    uint64_t cross = (lo_lo >> 32) + (hi_lo & 0xFFFFFFFF) + lo_hi;
    uint64_t upper = (hi_lo >> 32) + (cross >> 32)        + hi_hi;

    *high = upper;
    return (cross << 32) | (lo_lo & 0xFFFFFFFF);
#endif /* portable */
}

On ARMv6, you can't get much better than this, at least on Clang:

multiply64to128:
        push    {r4, r5, r11, lr}
        umull   r12, r5, r2, r0
        umull   r2, r4, r2, r1
        umaal   r2, r5, r3, r0
        umaal   r4, r5, r3, r1
        ldr     r0, [sp, #16]
        mov     r1, r2
        strd    r4, r5, [r0]
        mov     r0, r12
        pop     {r4, r5, r11, pc}

The accepted solution generates a bunch of adds and adc, as well as an extra umull in Clang due to an instcombine bug.

I further explain the portable method in the link I posted.

EasyasPi
  • 430
  • 8
  • 8
  • 1
    `__attribute__((__target__("no-sse")))` will probably stop it from inlining into functions with different target options, which might defeat constant propagation as well as adding call/ret overhead (especially in the nasty stack-args calling convention most 32-bit code uses). But that's only for 32-bit x86 so it's probably not going to hurt many use-cases. – Peter Cordes Oct 14 '19 at 17:49
  • That is true. However, from testing on a Sandy Bridge, the shuffles completely bottleneck the algorithm. – EasyasPi Oct 14 '19 at 18:59
  • Have you reported the missed-optimization bug to gcc's bugzilla? I was just pointing out that the workaround isn't perfect, but IDK if there's a way to do anything cheaper without using `-fno-tree-vectorize` for the whole file. Your attribute is might well be the best choice if `-O3 -march=native` shoots itself in the foot that badly. – Peter Cordes Oct 15 '19 at 01:42