Ways to do modulo multiplication with primitive types

Question

Is there a way to build e.g. (853467 * 21660421200929) % 100000000000007 without BigInteger libraries (note that each number fits into a 64 bit integer but the multiplication result does not)?

This solution seems inefficient:

int64_t mulmod(int64_t a, int64_t b, int64_t m) {
    if (b < a)
        std::swap(a, b);
    int64_t res = 0;
    for (int64_t i = 0; i < a; i++) {
        res += b;
        res %= m;
    }
    return res;
}

For one thing, I'd recommend getting rid of the Microsoft extensions and using `int64_t`. — chris, Aug 28 '12 at 22:29
It looks like in this case you could cheat because you don't care about anything greater than parameter `__int64 m` (or `uint64_t` for those that favor it) hence you could deal only with 64-bit types. — MartyE, Aug 28 '12 at 22:29
Have you read about the [Montgomery reduction](http://en.wikipedia.org/wiki/Montgomery_reduction) algorithm? — ildjarn, Aug 28 '12 at 22:31
@ildjarn, Montgomery reduction is unlikely to be a win in this case, since the product needs to be converted to Montgomery representation and back. — Brett Hale, Aug 28 '12 at 22:52
@Brett : I make no claims that it's the best approach, it's just the only one whose name I could remember off the top of my head. :-P — ildjarn, Aug 28 '12 at 22:55
GCC, Clang, and Intel ICC support `__int128`. I thought I remembered a similar name for MSVC, but struggle to find it now. — Mooing Duck, May 03 '21 at 20:09
VS has [`_mul128`](https://learn.microsoft.com/en-us/cpp/intrinsics/mul128?view=msvc-160) and [`_div128`](https://learn.microsoft.com/en-us/cpp/intrinsics/div128?view=msvc-160) I guess. Also [these](https://learn.microsoft.com/en-us/windows/win32/winprog/large-integer-functions) — Mooing Duck, May 03 '21 at 20:11

score 25 · Accepted Answer · edited Aug 29 '12 at 19:50

25

You should use Russian Peasant multiplication. It uses repeated doubling to compute all the values (b*2^i)%m, and adds them in if the ith bit of a is set.

uint64_t mulmod(uint64_t a, uint64_t b, uint64_t m) {
    int64_t res = 0;
    while (a != 0) {
        if (a & 1) res = (res + b) % m;
        a >>= 1;
        b = (b << 1) % m;
    }
    return res;
}

It improves upon your algorithm because it takes O(log(a)) time, not O(a) time.

Caveats: unsigned, and works only if m is 63 bits or less.

edited Aug 29 '12 at 19:50

Christian Ammer

7,464
6
51
108

answered Aug 29 '12 at 04:42

Keith Randall

22,985
2
35
54

Should `res` be declared `uint64_t`? – Thomas Weller Oct 18 '21 at 20:15

score 24 · Answer 2 · edited May 23 '17 at 12:25

24

Keith Randall's answer is good, but as he said, a caveat is that it works only if m is 63 bits or less.

Here is a modification which has two advantages:

It works even if m is 64 bits.
It doesn't need to use the modulo operation, which can be expensive on some processors.

(Note that the res -= m and temp_b -= m lines rely on 64-bit unsigned integer overflow in order to give the expected results. This should be fine since unsigned integer overflow is well-defined in C and C++. For this reason it's important to use unsigned integer types.)

uint64_t mulmod(uint64_t a, uint64_t b, uint64_t m) {
    uint64_t res = 0;
    uint64_t temp_b;

    /* Only needed if b may be >= m */
    if (b >= m) {
        if (m > UINT64_MAX / 2u)
            b -= m;
        else
            b %= m;
    }

    while (a != 0) {
        if (a & 1) {
            /* Add b to res, modulo m, without overflow */
            if (b >= m - res) /* Equiv to if (res + b >= m), without overflow */
                res -= m;
            res += b;
        }
        a >>= 1;

        /* Double b, modulo m */
        temp_b = b;
        if (b >= m - b)       /* Equiv to if (2 * b >= m), without overflow */
            temp_b -= m;
        b += temp_b;
    }
    return res;
}

edited May 23 '17 at 12:25

Community

1
1

answered Sep 08 '13 at 04:02

Craig McQueen

41,871
30
130
181

4

I like this, since it handles full 64 bit values. If you test whether `b < a` at the top, and if so, swap a and b, it can significantly speed up the time since it's more likely the while loop can exit early. – hatchet - done with SOverflow Dec 21 '15 at 16:04
Why can't you do `b %= m` if `m>UINT64_MAX / 2u`? Does modulo operation magically becomes unstable? – Adrian Feb 22 '19 at 17:52
You can definitely do `b %= m`. But, modulo operations can be slow (depending on the processor) so are worth avoiding if possible. So, `if (m > UINT64_MAX / 2u) b -= m;` is a possible optimisation to avoid a modulo operation in the case that `m` is large, so the modulo can be simplified down to a simple subtraction. – Craig McQueen Feb 23 '19 at 20:12
Might be a tad late with this comment, but the `res += b` and `b += temp_b` do also overflow, which is not mentioned in your answer, even though the `-=` operations are specifically mentioned. Not from a C++ background, so I'm not sure if this is standard behaviour, but maybe add that in your answer? – Bram Sep 15 '20 at 21:40

Benjamin · Answer 3 · 2012-08-28T22:53:43.623

5

Both methods work for me. The first one is the same as yours, but I changed your numbers to excplicit ULL. Second one uses assembler notation, which should work faster. There are also algorithms used in cryptography (RSA and RSA based cryptography mostly I guess), like already mentioned Montgomery reduction as well, but I think it will take time to implement them.

#include <algorithm>
#include <iostream>

__uint64_t mulmod1(__uint64_t a, __uint64_t b, __uint64_t m) {
  if (b < a)
    std::swap(a, b);
  __uint64_t res = 0;
  for (__uint64_t i = 0; i < a; i++) {
    res += b;
    res %= m;
  }
  return res;
}

__uint64_t mulmod2(__uint64_t a, __uint64_t b, __uint64_t m) {
  __uint64_t r;
  __asm__
  ( "mulq %2\n\t"
      "divq %3"
      : "=&d" (r), "+%a" (a)
      : "rm" (b), "rm" (m)
      : "cc"
  );
  return r;
}

int main() {
  using namespace std;
  __uint64_t a = 853467ULL;
  __uint64_t b = 21660421200929ULL;
  __uint64_t c = 100000000000007ULL;

  cout << mulmod1(a, b, c) << endl;
  cout << mulmod2(a, b, c) << endl;
  return 0;
}

edited Aug 28 '12 at 22:53

answered Aug 28 '12 at 22:45

Benjamin

3,217
2
27
42

I don't know about inline assembler, does it also use a loop? – Christian Ammer Aug 28 '12 at 22:55
1

@ChristianAmmer no and it doesn't need one. It uses a double-width multiplication and division is always double-width. It's only in high level languages that the high part of the multiplication suddenly gets lost. – harold Aug 28 '12 at 23:04
1

This is OK for the example, but if `(a * b > (2^64 - 1) * c)` it will fail. But I assume the OP means that the implied quotient is a 64-bit value as well. – Brett Hale Aug 28 '12 at 23:21
About loop: I don't know how assembler calculates the multiplication,I mean under the hoos, but on C++ there is no need for a loop, because due to %uint64 we know, that the result is 64bits at most. @Brett 3 numers are 64bit. – Benjamin Aug 28 '12 at 23:28
@Benjamin - I'm just pointing out that the assembly implementation isn't general. Try: `a=8534670000000000000`, `b=216604212009290`. Both are 64-bit, but `divq` will result in an exception. – Brett Hale Aug 28 '12 at 23:38
Sure it isn't general. Try a=b=2^64-2 c=2^64-1 in every of the cases above (I mean other users answers). In this case assembler is the only hope, but I cannot do it much better. Anyway, in your example some pre-counting would work (dividing(%) a and b before multiplication would do the job) – Benjamin Aug 28 '12 at 23:48
`mulmod1()` is O(a) which could be unusably slow (the algorithm in the question). `mulmod2()` is not cross-platform. It's fine to provide an assembly function for optimisation, but it's always good to have a reasonably well-performing cross-platform (i.e. pure C/C++) fallback. – Craig McQueen Jan 12 '14 at 23:56
Minor: "changed your numbers to excplicit ULL", Hmm, looks like numbers are 64-bit integer instead. – chux - Reinstate Monica Nov 13 '14 at 00:55

Aki Suihkonen · Answer 4 · 2014-02-20T12:38:05.297

An improvement to the repeating doubling algorithm is to check how many bits at once can be calculated without an overflow. An early exit check can be done for both arguments -- speeding up the (unlikely?) event of N not being prime.

e.g. 100000000000007 == 0x00005af3107a4007, which allows 16 (or 17) bits to be calculated per each iteration. The actual number of iterations will be 3 with the example.

// just a conceptual routine
int get_leading_zeroes(uint64_t n)
{
   int a=0;
   while ((n & 0x8000000000000000) == 0) { a++; n<<=1; }
   return a;
}

uint64_t mulmod(uint64_t a, uint64_t b, uint64_t n)
{
     uint64_t result = 0;
     int N = get_leading_zeroes(n);
     uint64_t mask = (1<<N) - 1;
     a %= n;
     b %= n;  // Make sure all values are originally in the proper range?
     // n is not necessarily a prime -- so both a & b can end up being zero
     while (a>0 && b>0)
     {
         result = (result + (b & mask) * a) % n;  // no overflow
         b>>=N;
         a = (a << N) % n;
     }
     return result;
}

+1, Nice speed improvement, but would recommend a check for `n==0` to prevent an infinite loop in `get_leading_zeroes()`. — chux - Reinstate Monica, Nov 13 '14 at 00:58

score 3 · Answer 5 · answered Aug 28 '12 at 22:30

3

You could try something that breaks the multiplication up into additions:

// compute (a * b) % m:

unsigned int multmod(unsigned int a, unsigned int b, unsigned int m)
{
    unsigned int result = 0;

    a %= m;
    b %= m;

    while (b)
    {
        if (b % 2 != 0)
        {
            result = (result + a) % m;
        }

        a = (a * 2) % m;
        b /= 2;
    }

    return result;
}

answered Aug 28 '12 at 22:30

Kerrek SB

464,522
92
875
1,084

+1 for a working solution, I have to think about it to fully understand, but it works because `(a * b) == (a * 2) * (b / 2)`, right? – Christian Ammer Aug 28 '12 at 22:49
1

It will actually fail for some inputs. `a * 2` can overflow and be reduced incorrectly if `m` is bigger than `1 << 63` (or `1 << 31` if ints are 32 bit). – harold Aug 28 '12 at 23:12
You can in fact reduce by `(~0ULL/m)` in each step. Eg. for 100000000000007, you can use 131072 (`1<<17`) instead of 2. That also explains harold's comment; for such large `m` the stepsize becomes 1 and you make no progress. – MSalters Aug 28 '12 at 23:27
@harold: You're right: The first factor must not have its top bit set. I believe that's the only limitation on the function argument values of the present algorithm, though. – Kerrek SB Aug 29 '12 at 07:30

score 2 · Answer 6 · answered May 23 '18 at 04:15

a * b % m equals a * b - (a * b / m) * m

Use floating point arithmetic to approximate a * b / m. The approximation leaves a value small enough for normal 64 bit integer operations, for m up to 63 bits.

This method is limited by the significand of a double, which is usually 52 bits.

uint64_t mod_mul_52(uint64_t a, uint64_t b, uint64_t m) {
    uint64_t c = (double)a * b / m - 1;
    uint64_t d = a * b - c * m;

    return d % m;
}

This method is limited by the significand of a long double, which is usually 64 bits or larger. The integer arithmetic is limited to 63 bits.

uint64_t mod_mul_63(uint64_t a, uint64_t b, uint64_t m) {
    uint64_t c = (long double)a * b / m - 1;
    uint64_t d = a * b - c * m;

    return d % m;
}

These methods require that a and b be less than m. To handle arbitrary a and b, add these lines before c is computed.

a = a % m;
b = b % m;

In both methods, the final % operation could be made conditional.

return d >= m ? d % m : d;

score 1 · Answer 7 · answered Aug 28 '12 at 22:46

1

I can suggest an improvement for your algorithm.

You actually calculate a * b iteratively by adding each time b, doing modulo after each iteration. It's better to add each time b * x, whereas x is determined so that b * x won't overflow.

int64_t mulmod(int64_t a, int64_t b, int64_t m)
{
    a %= m;
    b %= m;

    int64_t x = 1;
    int64_t bx = b;

    while (x < a)
    {
        int64_t bb = bx * 2;
        if (bb <= bx)
            break; // overflow

        x *= 2;
        bx = bb;
    }

    int64_t ans = 0;

    for (; x < a; a -= x)
        ans = (ans + bx) % m;

    return (ans + a*b) % m;
}

answered Aug 28 '12 at 22:46

valdo

12,632
2
37
67

1

Can't you use `x=(1<<63-m)/b` ? That's rounded down, so `b*x <= 1<<63 - m`, and it doesn't take a loop to calculate that. Doesn't change big-O, as the number of for-loop iterations decreases by <50%. – MSalters Aug 28 '12 at 23:39
@MSalters: doesn't that introduce a division, which is potentially expensive? – Craig McQueen Nov 06 '13 at 00:17
@CraigMcQueen: Yes, but only one, and there's already a modulo in a loop. – MSalters Nov 06 '13 at 01:04
That's not improvement -- I don't know the O() for the last loop, but taking just 3 "random" numbers the loop ran over 300000 iterations. – Aki Suihkonen Feb 20 '14 at 07:48

Ways to do modulo multiplication with primitive types

7 Answers7

Linked