Compute (a*b)%n FAST for 64-bit unsigned arguments in C(++) on x86-64 platforms?

Question

I'm looking for a fast method to efficiently compute (a⋅b) modulo n (in the mathematical sense of that) for a, b, n of type uint64_t. I could live with preconditions such as n!=0, or even a<n && b<n.

Notice that the C expression (a*b)%n won't cut it, because the product is truncated to 64 bits. I'm looking for (uint64_t)(((uint128_t)a*b)%n) except that I do not have a uint128_t (that I know, in Visual C++).

I'm in for a Visual C++ (preferably) or GCC/clang intrinsic making best use of the underlying hardware available on x86-64 platforms; or if that can't be done for a portable inline function.

@self: This does not lead to a fast solution, I'm afraid. I know GMP, and love it for a quickly done program under MinGW32. But when performance really counts and small arguments, it may not be fast enough for my taste. I admit I did not check. — fgrieu, Dec 13 '13 at 17:26
@self. These libraries are virtually always optimized for very large numbers (and sometimes smaller-than-word-sized integers too). Their algorithms, however, can probably be re purposed for this special case with some loop unrolling and manual register allocation. — , Dec 13 '13 at 17:27
Seems to be a duplicate of this one? http://stackoverflow.com/questions/12168348/ways-to-do-modulo-multiplication-with-primitive-types — Nikolay, Dec 13 '13 at 17:28
@Nikolay: looks very much like the second example [there](http://stackoverflow.com/a/12168589/903600) does exactly what I want! — fgrieu, Dec 13 '13 at 17:34
@fgrieu: note that the example you linked to uses inline assembler which MSVC doesn't support for 64-bit targets. However, it should be easy enough to make it a stand-alone assembler function that's linked in. Or you could compile that one C function with inline assembler using MinGW and link that object file in to your MSVC projecgt. — Michael Burr, Dec 13 '13 at 18:22
@Guido: `((a%n)*(b%n))%n` won't do when `n` is large. Example: `a=b=0x123456789` and `n=0x234567891`. `a%n` and `b%n` are `0x123456789`. `(a%n)*(b%n)` is `0x4b66dc326fb98751`, truncated from `0x14b66dc326fb98751`. `((a%n)*(b%n))%n` is `0x11b3c4d59`, instead of the mathematically correct `0x141cebed5`. — fgrieu, Dec 14 '13 at 05:46
It's easy to do multi-precision multiply/add/subtract. Hard to do multi-precision divide. There are a few tricks, but they don't help much. In fact, it's sometimes fastest to do successive approximation, if the multiply is fast enough. — Hot Licks, Jan 09 '14 at 23:07
possible duplicate of [Most accurate way to do a combined multiply-and-divide operation in 64-bit?](http://stackoverflow.com/questions/8733178/most-accurate-way-to-do-a-combined-multiply-and-divide-operation-in-64-bit) — phuclv, Aug 02 '15 at 17:50
@Peter Cordes; sorry, that was a silly move in the the joy of the instant, thanks for fixing my eclipse of the mind. [I'll remove that comment within 24h]. — fgrieu, Jan 10 '21 at 10:14

score 5 · Answer 1 · answered Dec 13 '13 at 17:24

5

Ok, how about this (not tested)

modmul:
; rcx = a
; rdx = b
; r8 = n
mov rax, rdx
mul rcx
div r8
mov rax, rdx
ret

The precondition is that a * b / n <= ~0ULL, otherwise there will be a divide error. That's a slightly less strict condition than a < n && m < n, one of them can be bigger than n as long as the other is small enough.

Unfortunately it has to be assembled and linked in separately, because MSVC doesn't support inline asm for 64bit targets.

It's also still slow, the real problem is that 64bit div, which can take nearly a hundred cycles (seriously, up to 90 cycles on Nehalem for example).

answered Dec 13 '13 at 17:24

harold

61,398
6
86
164

Might be worth mentioning that `div` has a 128-bit dividend, not 64-bit as your answer appears to indicate. – Ben Voigt Dec 13 '13 at 18:19
It might also be worth it using inline assembler so that compiler may inline this trivial code and save on function call overhead. – Dec 13 '13 at 18:47
@vlad the visual c++ x86_64 compiler doesn't support inline assembly, as the answer notes. – Ben Voigt Jan 09 '14 at 23:09
@BenVoigt: Then I am glad I don't have to use it :) – Jan 12 '14 at 01:04

Chris Dodd · Answer 2 · 2014-01-09T23:41:21.177

2

You could do it the old-fashioned way with shift/add/subtract. The below code assumes a < n and
n < 2⁶³ (so things don't overflow):

uint64_t mulmod(uint64_t a, uint64_t b, uint64_t n) {
    uint64_t rv = 0;
    while (b) {
        if (b&1)
            if ((rv += a) >= n) rv -= n;
        if ((a += a) >= n) a -= n;
        b >>= 1; }
    return rv;
}

You could use while (a && b) for the loop instead to short-circuit things if it's likely that a will be a factor of n. Will be slightly slower (more comparisons and likely correctly predicted branches) if a is not a factor of n.

If you really, absolutely, need that last bit (allowing n up to 2⁶⁴-1), you can use:

uint64_t mulmod(uint64_t a, uint64_t b, uint64_t n) {
    uint64_t rv = 0;
    while (b) {
        if (b&1) {
            rv += a;
            if (rv < a || rv >= n) rv -= n; }
        uint64_t t = a;
        a += a;
        if (a < t || a >= n) a -= n;
        b >>= 1; }
    return rv;
}

Alternately, just use GCC instrinsics to access the underlying x64 instructions:

inline uint64_t mulmod(uint64_t a, uint64_t b, uint64_t n) {
    uint64_t rv;
    asm ("mul %3" : "=d"(rv), "=a"(a) : "1"(a), "r"(b));
    asm ("div %4" : "=d"(rv), "=a"(a) : "0"(rv), "1"(a), "r"(n));
    return rv;
}

The 64-bit div instruction is really slow, however, so the loop might actually be faster. You'd need to profile to be sure.

edited Jan 09 '14 at 23:41

answered Dec 13 '13 at 18:41

Chris Dodd

119,907
13
134
226

You've got overflow problems (they don't create undefined behavior for integral types, but they will mess up the comparison). You need to use the technique shown in my answer to circumvent that (and yes, I realized that the multiplication can be done in my same loop). – Ben Voigt Dec 14 '13 at 06:00
True, but fgrieu emphasized in the comments that the case when n >= 2**63 is the one of most concern. – Ben Voigt Dec 15 '13 at 15:54
@BenVoigt: ok, if you absolutely need that last bit, you can manually check for overflow, making the code just a little bit slower. – Chris Dodd Jan 09 '14 at 22:59
Could not `a += a;` mathematically overflow is `a >= 2**63`? or is that compensated by the `if (a < t` part? – chux - Reinstate Monica Jun 16 '16 at 20:07
1

@chux: exactly -- overflow will wrap with unsigned numbers, so `a < t` implies that `a += a` overflowed, and the actual value for the sum is `a + 2**64`, but that is less than `2*n`, so after subtracting `n` we'll underflow (reversing the effect of the overflow) and be back in the valid range of a uint64_t. – Chris Dodd Jun 16 '16 at 20:20
Thanks. Idea: If one wanted to avoid overflow (maybe because using `int`), code could instead use `type_of_n nhalf = n/2; /* one time calculation */ ..... if (a >= nhalf) a -= n-a; else a += a;`. – chux - Reinstate Monica Jun 16 '16 at 20:36
if you use a signed type, you lose a bit to the sign, as well as losing predictable overflow behaviour. You still need an extra (hard to predict) branch to avoid the overflow, so it doesn't save you anything. – Chris Dodd Jun 16 '16 at 20:40

score 2 · Accepted Answer · answered Jan 09 '21 at 20:57

7 years later, I got a solution working in Visual Studio 2019

#include <stdint.h>
#include <intrin.h>
#pragma intrinsic(_umul128)
#pragma intrinsic(_udiv128)

// compute (a*b)%n with 128-bit intermediary result
// assumes n>0  and  a*b < n * 2**64 (always the case when a<=n || b<=n )
inline uint64_t mulmod(uint64_t a, uint64_t b, uint64_t n) {
  uint64_t r, s = _umul128(a, b, &r);
  (void)_udiv128(r, s, n, &r);
  return r;
}

// compute (a*b)%n with 128-bit intermediary result
// assumes n>0, works including if a*b >= n * 2**64
inline uint64_t mulmod1(uint64_t a, uint64_t b, uint64_t n) {
  uint64_t r, s = _umul128(a % n, b, &r);
  (void)_udiv128(r, s, n, &r);
  return r;
}

Ben Voigt · Answer 4 · 2013-12-13T18:15:44.330

1

This intrinsic is named __mul128.

typedef unsigned long long BIG;

// handles only the "hard" case when high bit of n is set
BIG shl_mod( BIG v, BIG n, int by )
{
    if (v > n) v -= n;
    while (by--) {
        if (v > (n-v))
            v -= n-v;
        else
            v <<= 1;
    }
    return v;
}

Now you can use shl_mod(B, n, 64)

edited Dec 13 '13 at 18:15

answered Dec 13 '13 at 17:19

Ben Voigt

277,958
43
419
720

But unfortunately, this performs only the product. – fgrieu Dec 13 '13 at 17:23
That catches me every time... most MSDN links the version can be removed to make a link to latest, for this it can't. Now fixed. – Ben Voigt Dec 13 '13 at 17:23
1

@fgrieu: Correct, but the complaint in your question was about truncation of the product. This avoids that. Now you need to perform `(A * (2**64) + B) % n` (where `A` and `B` are the results of the multiply, not your original `a` and `b`), which simplifies to `((A % n) * (2**64 % n)) % n + B % n`... and there are well-known methods for this. Do you use the same `n` repeatedly? In that case, precompute `2**64 % n` – Ben Voigt Dec 13 '13 at 17:25
When both high and low bits of `n` are set, I know no fast method for `(A * (2**64) + B) % n` – fgrieu Dec 13 '13 at 17:31
Did not get your hint. When both high and low bits of `n` are set, `2**64 % n`is `0U-n`, and `2**32 % n` `is 1ULL<<32`. Neither helps so much. Ah, or perhaps you suggest to pre-compute `2**128 / n` and do [Barrett reduction](http://en.wikipedia.org/wiki/Barrett_reduction)? That's a tad complex. – fgrieu Dec 13 '13 at 17:52
Right, the 2**32 was useless. I was going to suggest the `-n`, but I see you've already got that. – Ben Voigt Dec 13 '13 at 18:11
@fgrieu: Can you see anything wrong with the function I just added to my answer? – Ben Voigt Dec 13 '13 at 18:17
1

It looks to me that the code you posted correctly computes `(2**64 * v)%n` when the high bit of `n` is set, which indeed is the hardest case in the computation of `(A * (2**64) + B) % n`, in which we want to compute `shl_mod(A, n, 64)`. My only issue is with the execution time. That extends to other `n`, e.g. with the but-highest bit set. – fgrieu Dec 14 '13 at 05:56
2

It would be interesting to see optimized code generated for that by your favorite compiler. I think that `v > (n-v)` will be done by actually calculating `v - (n-v)` and checking the overflow bit... therefore not recalculating the difference on the next line. If the loop is unrolled, there could be only about 4 instructions per iteration, and these are all low strength operations, likely to retire several per cycle, so it could be only a few percent slower than the DIV instruction used by Harold's assembly solution. – Ben Voigt Dec 14 '13 at 05:59
[Clang-3.8 -O3 makes near-ideal code](https://godbolt.org/g/D1t3tp) (which as you suggest, branches on the result of a `sub`.) The serial dependencies limit loop throughput to one iteration per 5 clocks (on Intel Haswell), according to IACA. I only see 4 cycles of latency (since neither `mov` is on the critical path, but IACA thinks `mov rdx, rdi` is) , but even so, 4c * 64 is *far* worse than a `div r64` (32-96 cycles latency on Haswell, similar on Skylake. We should probably expect the time to be towards the upper end of that range if the quotient has most of its bits set.) – Peter Cordes Jun 15 '16 at 02:25
So if you're confident the `div` won't fault (e.g. quotient overflows a 64bit reg), hardware `div` is a much better bet. BTW, Broadwell/Skylake have 1c latency on `cmov`, but even 3c * 64 is not a win. – Peter Cordes Jun 15 '16 at 02:27
If you can interleave maybe four of these calculations in the same loop, then you can certainly come out ahead, because multiple dependency chains in parallel creates instruction-level parallelism. The out-of-order execution window isn't big enough to start working on a 2nd loop before the first loop is mostly done (this is limited more by schedule size, not ROB size), so you do need the instructions to be interleaved in the code. Hyperthreading will also help with this, but not with `div`, since both threads will compete for the single divider. – Peter Cordes Jun 15 '16 at 02:32

yonil · Answer 5 · 2016-06-14T22:55:40.060

Having no inline assembly kind of sucks. Anyway, the function call overhead is actually extremely small. Parameters are passed in volatile registers and no cleanup is needed.

I don't have an assembler, and x64 targets don't support __asm, so I had no choice but to "assemble" my function from opcodes myself.

Obviously it depends on . I'm using mpir (gmp) as a reference to show the function produces correct results.


#include "stdafx.h"

// mulmod64(a, b, m) == (a * b) % m
typedef uint64_t(__cdecl *mulmod64_fnptr_t)(uint64_t a, uint64_t b, uint64_t m);

uint8_t mulmod64_opcodes[] = {
    0x48, 0x89, 0xC8, // mov rax, rcx
    0x48, 0xF7, 0xE2, // mul rdx
    0x4C, 0x89, 0xC1, // mov rcx, r8
    0x48, 0xF7, 0xF1, // div rcx
    0x48, 0x89, 0xD0, // mov rax,rdx
    0xC3              // ret
};

mulmod64_fnptr_t mulmod64_fnptr;

void init() {
    DWORD dwOldProtect;
    VirtualProtect(
        &mulmod64_opcodes,
        sizeof(mulmod64_opcodes),
        PAGE_EXECUTE_READWRITE,
        &dwOldProtect);
    // NOTE: reinterpret byte array as a function pointer
    mulmod64_fnptr = (mulmod64_fnptr_t)(void*)mulmod64_opcodes;
}

int main() {
    init();

    uint64_t a64 = 2139018971924123ull;
    uint64_t b64 = 1239485798578921ull;
    uint64_t m64 = 8975489368910167ull;

    // reference code
    mpz_t a, b, c, m, r;
    mpz_inits(a, b, c, m, r, NULL);
    mpz_set_ui(a, a64);
    mpz_set_ui(b, b64);
    mpz_set_ui(m, m64);
    mpz_mul(c, a, b);
    mpz_mod(r, c, m);

    gmp_printf("(%Zd * %Zd) mod %Zd = %Zd\n", a, b, m, r);

    // using mulmod64
    uint64_t r64 = mulmod64_fnptr(a64, b64, m64);
    printf("(%llu * %llu) mod %llu = %llu\n", a64, b64, m64, r64);
    return 0;
}

That's ridiculous; just write the function in a `.asm` source file that you assemble and link normally. Then you don't have to call it through a non-const function pointer (which increases the overhead), or change memory protection at run-time. — Peter Cordes, Jun 15 '16 at 01:41
You could actually make your function pointer `const` and initialize it statically with a cast. The linker should be able to handle that relocation. However, I don't see a way around doing a `VirtualProtect` call at run-time. This is unavoidable if you're going to include machine-code in an array. (which should be `const`...) **There's just no benefit to this approach, because it's easy to link asm sources with C sources.** — Peter Cordes, Jun 15 '16 at 01:44

Compute (a*b)%n FAST for 64-bit unsigned arguments in C(++) on x86-64 platforms?

5 Answers5

Linked

Related