Fastest way to multiply two 64-bit ints to 128-bit then >> to 64-bit?

Question

I need to multiply two signed 64-bit integers a and b together, then shift the (128-bit) result to a signed 64-bit integer. What's the fastest way to do that?

My 64-bit integers actually represent fixed-point numbers with fmt fractional bits. fmt is chosen so that a * b >> fmt should not overflow, for instance abs(a) < 64<<fmt and abs(b) < 2<<fmt with fmt==56 would never overflow in 64-bits as the final result would be < 128<<fmt and therefore fit in an int64.

The reason I want to do that is to quickly and precisely evaluate quintic polynomials of the form ((((c5*x + c4)*x + c3)*x + c2)*x + c1)*x + c0 in fixed point format, with every number a signed 64-bit fixed-point number with fmt fractional bits. I'm looking for the most efficient way to achieve that.

Your statement of the question suggests you may have already tried an implementation. If so, can you post your code? — ryyker, Jul 27 '15 at 12:28
I suspect the fastest way to do that is just to do that (assuming you have an existing int128 implementation you can take advantage of). — Oliver Charlesworth, Jul 27 '15 at 12:32
@ryyker I have not, I've only tried the same thing with int32, double and __float128, never with int64, so I've never had to deal with an int128 result. — Michel Rouzic, Jul 27 '15 at 13:07
@Oliver Charlesworth This is meant to be portable code, I'm not aware of a int128 implementation that would be widely available. I figured something that doesn't require an int128 type would be feasible, since after all whatever the compiler might do is something I can do without an int128 type, right? I think my requiring a shift to get an int64 result might allow for some clever tricks. — Michel Rouzic, Jul 27 '15 at 13:08
It'll be helpful to have some info of the ISA. It's usually a lot easier to write it non-portable. — user3528438, Jul 27 '15 at 15:28
@user3528438 alright, it's modern PCs in general, so mostly x86_64. I could have it done a non-portable way with a portable fallback if that's necessary. — Michel Rouzic, Jul 27 '15 at 17:57
See the suggestion here of using SSE4, http://stackoverflow.com/questions/17863411/sse-multiplication-of-2-64-bit-integers — Jens Munk, Jul 27 '15 at 19:25
[Efficient computation of the high order bits of a multiplication](http://stackoverflow.com/q/1396942/995714), [Reasonably portable way to get top 64-bits from 64x64 bit multiply?](http://stackoverflow.com/q/26852435/995714), [Getting the high part of 64 bit integer multiplication](http://stackoverflow.com/q/28868367/995714) — phuclv, Apr 26 '16 at 16:17

njuffa · Accepted Answer · 2016-04-26T15:57:04.557

As a commenter on the question pointed out, this is most easily accomplished efficiently by machine-dependent code, rather than by portable code. The asker states that the main platform is x86_64, and that has a built-in instruction for performing 64 ✕ 64 → 128 bit multiplication. This is easily accessed using a small piece of inline assembly. Note that details of inline assembly may differ somewhat with compiler, the code below was built with the Intel C/C++ compiler.

#include <stdint.h>

/* compute mul_wide (a, b) >> s, for s in [0,63] */
int64_t mulshift (int64_t a, int64_t b, int s)
{
    int64_t res;
    __asm__ (
        "movq  %1, %%rax;\n\t"          // rax = a
        "movl  %3, %%ecx;\n\t"          // ecx = s
        "imulq %2;\n\t"                 // rdx:rax = a * b
        "shrdq %%cl, %%rdx, %%rax;\n\t" // rax = int64_t (rdx:rax >> s)
        "movq  %%rax, %0;\n\t"          // res = rax
        : "=rm" (res)
        : "rm"(a), "rm"(b), "rm"(s)
        : "%rax", "%rdx", "%ecx");
    return res;
}

A portable C99 equivalent to the above code is shown below. I have tested this extensively against the inline assembly version and no mismatches were found.

void umul64wide (uint64_t a, uint64_t b, uint64_t *hi, uint64_t *lo)
{
    uint64_t a_lo = (uint64_t)(uint32_t)a;
    uint64_t a_hi = a >> 32;
    uint64_t b_lo = (uint64_t)(uint32_t)b;
    uint64_t b_hi = b >> 32;

    uint64_t p0 = a_lo * b_lo;
    uint64_t p1 = a_lo * b_hi;
    uint64_t p2 = a_hi * b_lo;
    uint64_t p3 = a_hi * b_hi;

    uint32_t cy = (uint32_t)(((p0 >> 32) + (uint32_t)p1 + (uint32_t)p2) >> 32);

    *lo = p0 + (p1 << 32) + (p2 << 32);
    *hi = p3 + (p1 >> 32) + (p2 >> 32) + cy;
}

void mul64wide (int64_t a, int64_t b, int64_t *hi, int64_t *lo)
{
    umul64wide ((uint64_t)a, (uint64_t)b, (uint64_t *)hi, (uint64_t *)lo);
    if (a < 0LL) *hi -= b;
    if (b < 0LL) *hi -= a;
}

/* compute mul_wide (a, b) >> s, for s in [0,63] */
int64_t mulshift (int64_t a, int64_t b, int s)
{
    int64_t res;
    int64_t hi, lo;
    mul64wide (a, b, &hi, &lo);
    if (s) {
        res = ((uint64_t)hi << (64 - s)) | ((uint64_t)lo >> s);
    } else {
        res = lo;
    }
    return res;
}

Was about to make an implementation by combining 32x32->64 bit multipliers. Didn't have the imulq instruction. Verified your solution - it works as anticipated — Jens Munk, Jul 27 '15 at 21:10
Awesome thank you! Now I just need a portable fallback (for the still somewhat necessary 32-bit builds or eventual other platforms) to go along with it. — Michel Rouzic, Jul 28 '15 at 20:46
Let me see what I can do in terms of portable fallback code. Shouldn't be too hard. — njuffa, Jul 28 '15 at 22:57
Rather than using inline assembly, try this: #include uint64_t multophalf_intrinsic(uint64_t a, uint64_t b) { unsigned long long hi = 0; _mulx_u64(a, b, &hi); return hi; } — jorgbrown, Mar 07 '18 at 23:03

Fastest way to multiply two 64-bit ints to 128-bit then >> to 64-bit?

1 Answers1

Linked