4

I need to multiply two signed 64-bit integers a and b together, then shift the (128-bit) result to a signed 64-bit integer. What's the fastest way to do that?

My 64-bit integers actually represent fixed-point numbers with fmt fractional bits. fmt is chosen so that a * b >> fmt should not overflow, for instance abs(a) < 64<<fmt and abs(b) < 2<<fmt with fmt==56 would never overflow in 64-bits as the final result would be < 128<<fmt and therefore fit in an int64.

The reason I want to do that is to quickly and precisely evaluate quintic polynomials of the form ((((c5*x + c4)*x + c3)*x + c2)*x + c1)*x + c0 in fixed point format, with every number a signed 64-bit fixed-point number with fmt fractional bits. I'm looking for the most efficient way to achieve that.

Michel Rouzic
  • 1,013
  • 1
  • 9
  • 22
  • 2
    Your statement of the question suggests you may have already tried an implementation. If so, can you post your code? – ryyker Jul 27 '15 at 12:28
  • I suspect the fastest way to do that is just to do that (assuming you have an existing int128 implementation you can take advantage of). – Oliver Charlesworth Jul 27 '15 at 12:32
  • @ryyker I have not, I've only tried the same thing with int32, double and __float128, never with int64, so I've never had to deal with an int128 result. – Michel Rouzic Jul 27 '15 at 13:07
  • @Oliver Charlesworth This is meant to be portable code, I'm not aware of a int128 implementation that would be widely available. I figured something that doesn't require an int128 type would be feasible, since after all whatever the compiler might do is something I can do without an int128 type, right? I think my requiring a shift to get an int64 result might allow for some clever tricks. – Michel Rouzic Jul 27 '15 at 13:08
  • 1
    It'll be helpful to have some info of the ISA. It's usually a lot easier to write it non-portable. – user3528438 Jul 27 '15 at 15:28
  • @user3528438 alright, it's modern PCs in general, so mostly x86_64. I could have it done a non-portable way with a portable fallback if that's necessary. – Michel Rouzic Jul 27 '15 at 17:57
  • See the suggestion here of using SSE4, http://stackoverflow.com/questions/17863411/sse-multiplication-of-2-64-bit-integers – Jens Munk Jul 27 '15 at 19:25
  • [Efficient computation of the high order bits of a multiplication](http://stackoverflow.com/q/1396942/995714), [Reasonably portable way to get top 64-bits from 64x64 bit multiply?](http://stackoverflow.com/q/26852435/995714), [Getting the high part of 64 bit integer multiplication](http://stackoverflow.com/q/28868367/995714) – phuclv Apr 26 '16 at 16:17

1 Answers1

10

As a commenter on the question pointed out, this is most easily accomplished efficiently by machine-dependent code, rather than by portable code. The asker states that the main platform is x86_64, and that has a built-in instruction for performing 64 ✕ 64 → 128 bit multiplication. This is easily accessed using a small piece of inline assembly. Note that details of inline assembly may differ somewhat with compiler, the code below was built with the Intel C/C++ compiler.

#include <stdint.h>

/* compute mul_wide (a, b) >> s, for s in [0,63] */
int64_t mulshift (int64_t a, int64_t b, int s)
{
    int64_t res;
    __asm__ (
        "movq  %1, %%rax;\n\t"          // rax = a
        "movl  %3, %%ecx;\n\t"          // ecx = s
        "imulq %2;\n\t"                 // rdx:rax = a * b
        "shrdq %%cl, %%rdx, %%rax;\n\t" // rax = int64_t (rdx:rax >> s)
        "movq  %%rax, %0;\n\t"          // res = rax
        : "=rm" (res)
        : "rm"(a), "rm"(b), "rm"(s)
        : "%rax", "%rdx", "%ecx");
    return res;
}

A portable C99 equivalent to the above code is shown below. I have tested this extensively against the inline assembly version and no mismatches were found.

void umul64wide (uint64_t a, uint64_t b, uint64_t *hi, uint64_t *lo)
{
    uint64_t a_lo = (uint64_t)(uint32_t)a;
    uint64_t a_hi = a >> 32;
    uint64_t b_lo = (uint64_t)(uint32_t)b;
    uint64_t b_hi = b >> 32;

    uint64_t p0 = a_lo * b_lo;
    uint64_t p1 = a_lo * b_hi;
    uint64_t p2 = a_hi * b_lo;
    uint64_t p3 = a_hi * b_hi;

    uint32_t cy = (uint32_t)(((p0 >> 32) + (uint32_t)p1 + (uint32_t)p2) >> 32);

    *lo = p0 + (p1 << 32) + (p2 << 32);
    *hi = p3 + (p1 >> 32) + (p2 >> 32) + cy;
}

void mul64wide (int64_t a, int64_t b, int64_t *hi, int64_t *lo)
{
    umul64wide ((uint64_t)a, (uint64_t)b, (uint64_t *)hi, (uint64_t *)lo);
    if (a < 0LL) *hi -= b;
    if (b < 0LL) *hi -= a;
}

/* compute mul_wide (a, b) >> s, for s in [0,63] */
int64_t mulshift (int64_t a, int64_t b, int s)
{
    int64_t res;
    int64_t hi, lo;
    mul64wide (a, b, &hi, &lo);
    if (s) {
        res = ((uint64_t)hi << (64 - s)) | ((uint64_t)lo >> s);
    } else {
        res = lo;
    }
    return res;
}
njuffa
  • 23,970
  • 4
  • 78
  • 130
  • Was about to make an implementation by combining 32x32->64 bit multipliers. Didn't have the imulq instruction. Verified your solution - it works as anticipated – Jens Munk Jul 27 '15 at 21:10
  • Awesome thank you! Now I just need a portable fallback (for the still somewhat necessary 32-bit builds or eventual other platforms) to go along with it. – Michel Rouzic Jul 28 '15 at 20:46
  • 1
    Let me see what I can do in terms of portable fallback code. Shouldn't be too hard. – njuffa Jul 28 '15 at 22:57
  • 3
    Rather than using inline assembly, try this: #include uint64_t multophalf_intrinsic(uint64_t a, uint64_t b) { unsigned long long hi = 0; _mulx_u64(a, b, &hi); return hi; } – jorgbrown Mar 07 '18 at 23:03