18

Is there a way to build e.g. (853467 * 21660421200929) % 100000000000007 without BigInteger libraries (note that each number fits into a 64 bit integer but the multiplication result does not)?

This solution seems inefficient:

int64_t mulmod(int64_t a, int64_t b, int64_t m) {
    if (b < a)
        std::swap(a, b);
    int64_t res = 0;
    for (int64_t i = 0; i < a; i++) {
        res += b;
        res %= m;
    }
    return res;
}
Christian Ammer
  • 7,464
  • 6
  • 51
  • 108
  • For one thing, I'd recommend getting rid of the Microsoft extensions and using `int64_t`. – chris Aug 28 '12 at 22:29
  • It looks like in this case you could cheat because you don't care about anything greater than parameter `__int64 m` (or `uint64_t` for those that favor it) hence you could deal only with 64-bit types. – MartyE Aug 28 '12 at 22:29
  • 1
    Have you read about the [Montgomery reduction](http://en.wikipedia.org/wiki/Montgomery_reduction) algorithm? – ildjarn Aug 28 '12 at 22:31
  • @ildjarn: No, didn't know about it, thanks for the link! – Christian Ammer Aug 28 '12 at 22:34
  • @ildjarn, Montgomery reduction is unlikely to be a win in this case, since the product needs to be converted to Montgomery representation and back. – Brett Hale Aug 28 '12 at 22:52
  • @Brett : I make no claims that it's the best approach, it's just the only one whose name I could remember off the top of my head. :-P – ildjarn Aug 28 '12 at 22:55
  • 2
    Funny, this would be trivial in x64 assembly. – harold Aug 28 '12 at 23:01
  • GCC, Clang, and Intel ICC support `__int128`. I thought I remembered a similar name for MSVC, but struggle to find it now. – Mooing Duck May 03 '21 at 20:09
  • VS has [`_mul128`](https://learn.microsoft.com/en-us/cpp/intrinsics/mul128?view=msvc-160) and [`_div128`](https://learn.microsoft.com/en-us/cpp/intrinsics/div128?view=msvc-160) I guess. Also [these](https://learn.microsoft.com/en-us/windows/win32/winprog/large-integer-functions) – Mooing Duck May 03 '21 at 20:11

7 Answers7

25

You should use Russian Peasant multiplication. It uses repeated doubling to compute all the values (b*2^i)%m, and adds them in if the ith bit of a is set.

uint64_t mulmod(uint64_t a, uint64_t b, uint64_t m) {
    int64_t res = 0;
    while (a != 0) {
        if (a & 1) res = (res + b) % m;
        a >>= 1;
        b = (b << 1) % m;
    }
    return res;
}

It improves upon your algorithm because it takes O(log(a)) time, not O(a) time.

Caveats: unsigned, and works only if m is 63 bits or less.

Christian Ammer
  • 7,464
  • 6
  • 51
  • 108
Keith Randall
  • 22,985
  • 2
  • 35
  • 54
24

Keith Randall's answer is good, but as he said, a caveat is that it works only if m is 63 bits or less.

Here is a modification which has two advantages:

  1. It works even if m is 64 bits.
  2. It doesn't need to use the modulo operation, which can be expensive on some processors.

(Note that the res -= m and temp_b -= m lines rely on 64-bit unsigned integer overflow in order to give the expected results. This should be fine since unsigned integer overflow is well-defined in C and C++. For this reason it's important to use unsigned integer types.)

uint64_t mulmod(uint64_t a, uint64_t b, uint64_t m) {
    uint64_t res = 0;
    uint64_t temp_b;

    /* Only needed if b may be >= m */
    if (b >= m) {
        if (m > UINT64_MAX / 2u)
            b -= m;
        else
            b %= m;
    }

    while (a != 0) {
        if (a & 1) {
            /* Add b to res, modulo m, without overflow */
            if (b >= m - res) /* Equiv to if (res + b >= m), without overflow */
                res -= m;
            res += b;
        }
        a >>= 1;

        /* Double b, modulo m */
        temp_b = b;
        if (b >= m - b)       /* Equiv to if (2 * b >= m), without overflow */
            temp_b -= m;
        b += temp_b;
    }
    return res;
}
Community
  • 1
  • 1
Craig McQueen
  • 41,871
  • 30
  • 130
  • 181
  • 4
    I like this, since it handles full 64 bit values. If you test whether `b < a` at the top, and if so, swap a and b, it can significantly speed up the time since it's more likely the while loop can exit early. – hatchet - done with SOverflow Dec 21 '15 at 16:04
  • Why can't you do `b %= m` if `m>UINT64_MAX / 2u`? Does modulo operation magically becomes unstable? – Adrian Feb 22 '19 at 17:52
  • You can definitely do `b %= m`. But, modulo operations can be slow (depending on the processor) so are worth avoiding if possible. So, `if (m > UINT64_MAX / 2u) b -= m;` is a possible optimisation to avoid a modulo operation in the case that `m` is large, so the modulo can be simplified down to a simple subtraction. – Craig McQueen Feb 23 '19 at 20:12
  • Might be a tad late with this comment, but the `res += b` and `b += temp_b` do also overflow, which is not mentioned in your answer, even though the `-=` operations are specifically mentioned. Not from a C++ background, so I'm not sure if this is standard behaviour, but maybe add that in your answer? – Bram Sep 15 '20 at 21:40
5

Both methods work for me. The first one is the same as yours, but I changed your numbers to excplicit ULL. Second one uses assembler notation, which should work faster. There are also algorithms used in cryptography (RSA and RSA based cryptography mostly I guess), like already mentioned Montgomery reduction as well, but I think it will take time to implement them.

#include <algorithm>
#include <iostream>

__uint64_t mulmod1(__uint64_t a, __uint64_t b, __uint64_t m) {
  if (b < a)
    std::swap(a, b);
  __uint64_t res = 0;
  for (__uint64_t i = 0; i < a; i++) {
    res += b;
    res %= m;
  }
  return res;
}

__uint64_t mulmod2(__uint64_t a, __uint64_t b, __uint64_t m) {
  __uint64_t r;
  __asm__
  ( "mulq %2\n\t"
      "divq %3"
      : "=&d" (r), "+%a" (a)
      : "rm" (b), "rm" (m)
      : "cc"
  );
  return r;
}

int main() {
  using namespace std;
  __uint64_t a = 853467ULL;
  __uint64_t b = 21660421200929ULL;
  __uint64_t c = 100000000000007ULL;

  cout << mulmod1(a, b, c) << endl;
  cout << mulmod2(a, b, c) << endl;
  return 0;
}
Benjamin
  • 3,217
  • 2
  • 27
  • 42
  • I don't know about inline assembler, does it also use a loop? – Christian Ammer Aug 28 '12 at 22:55
  • 1
    @ChristianAmmer no and it doesn't need one. It uses a double-width multiplication and division is always double-width. It's only in high level languages that the high part of the multiplication suddenly gets lost. – harold Aug 28 '12 at 23:04
  • 1
    This is OK for the example, but if `(a * b > (2^64 - 1) * c)` it will fail. But I assume the OP means that the implied quotient is a 64-bit value as well. – Brett Hale Aug 28 '12 at 23:21
  • About loop: I don't know how assembler calculates the multiplication,I mean under the hoos, but on C++ there is no need for a loop, because due to %uint64 we know, that the result is 64bits at most. @Brett 3 numers are 64bit. – Benjamin Aug 28 '12 at 23:28
  • @Benjamin - I'm just pointing out that the assembly implementation isn't general. Try: `a=8534670000000000000`, `b=216604212009290`. Both are 64-bit, but `divq` will result in an exception. – Brett Hale Aug 28 '12 at 23:38
  • Sure it isn't general. Try a=b=2^64-2 c=2^64-1 in every of the cases above (I mean other users answers). In this case assembler is the only hope, but I cannot do it much better. Anyway, in your example some pre-counting would work (dividing(%) a and b before multiplication would do the job) – Benjamin Aug 28 '12 at 23:48
  • `mulmod1()` is O(a) which could be unusably slow (the algorithm in the question). `mulmod2()` is not cross-platform. It's fine to provide an assembly function for optimisation, but it's always good to have a reasonably well-performing cross-platform (i.e. pure C/C++) fallback. – Craig McQueen Jan 12 '14 at 23:56
  • Minor: "changed your numbers to excplicit ULL", Hmm, looks like numbers are 64-bit integer instead. – chux - Reinstate Monica Nov 13 '14 at 00:55
5

An improvement to the repeating doubling algorithm is to check how many bits at once can be calculated without an overflow. An early exit check can be done for both arguments -- speeding up the (unlikely?) event of N not being prime.

e.g. 100000000000007 == 0x00005af3107a4007, which allows 16 (or 17) bits to be calculated per each iteration. The actual number of iterations will be 3 with the example.

// just a conceptual routine
int get_leading_zeroes(uint64_t n)
{
   int a=0;
   while ((n & 0x8000000000000000) == 0) { a++; n<<=1; }
   return a;
}

uint64_t mulmod(uint64_t a, uint64_t b, uint64_t n)
{
     uint64_t result = 0;
     int N = get_leading_zeroes(n);
     uint64_t mask = (1<<N) - 1;
     a %= n;
     b %= n;  // Make sure all values are originally in the proper range?
     // n is not necessarily a prime -- so both a & b can end up being zero
     while (a>0 && b>0)
     {
         result = (result + (b & mask) * a) % n;  // no overflow
         b>>=N;
         a = (a << N) % n;
     }
     return result;
}
Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
3

You could try something that breaks the multiplication up into additions:

// compute (a * b) % m:

unsigned int multmod(unsigned int a, unsigned int b, unsigned int m)
{
    unsigned int result = 0;

    a %= m;
    b %= m;

    while (b)
    {
        if (b % 2 != 0)
        {
            result = (result + a) % m;
        }

        a = (a * 2) % m;
        b /= 2;
    }

    return result;
}
Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084
  • +1 for a working solution, I have to think about it to fully understand, but it works because `(a * b) == (a * 2) * (b / 2)`, right? – Christian Ammer Aug 28 '12 at 22:49
  • 1
    It will actually fail for some inputs. `a * 2` can overflow and be reduced incorrectly if `m` is bigger than `1 << 63` (or `1 << 31` if ints are 32 bit). – harold Aug 28 '12 at 23:12
  • You can in fact reduce by `(~0ULL/m)` in each step. Eg. for 100000000000007, you can use 131072 (`1<<17`) instead of 2. That also explains harold's comment; for such large `m` the stepsize becomes 1 and you make no progress. – MSalters Aug 28 '12 at 23:27
  • @harold: You're right: The first factor must not have its top bit set. I believe that's the only limitation on the function argument values of the present algorithm, though. – Kerrek SB Aug 29 '12 at 07:30
2

a * b % m equals a * b - (a * b / m) * m

Use floating point arithmetic to approximate a * b / m. The approximation leaves a value small enough for normal 64 bit integer operations, for m up to 63 bits.

This method is limited by the significand of a double, which is usually 52 bits.

uint64_t mod_mul_52(uint64_t a, uint64_t b, uint64_t m) {
    uint64_t c = (double)a * b / m - 1;
    uint64_t d = a * b - c * m;

    return d % m;
}

This method is limited by the significand of a long double, which is usually 64 bits or larger. The integer arithmetic is limited to 63 bits.

uint64_t mod_mul_63(uint64_t a, uint64_t b, uint64_t m) {
    uint64_t c = (long double)a * b / m - 1;
    uint64_t d = a * b - c * m;

    return d % m;
}

These methods require that a and b be less than m. To handle arbitrary a and b, add these lines before c is computed.

a = a % m;
b = b % m;

In both methods, the final % operation could be made conditional.

return d >= m ? d % m : d;
drawnonward
  • 53,459
  • 16
  • 107
  • 112
1

I can suggest an improvement for your algorithm.

You actually calculate a * b iteratively by adding each time b, doing modulo after each iteration. It's better to add each time b * x, whereas x is determined so that b * x won't overflow.

int64_t mulmod(int64_t a, int64_t b, int64_t m)
{
    a %= m;
    b %= m;

    int64_t x = 1;
    int64_t bx = b;

    while (x < a)
    {
        int64_t bb = bx * 2;
        if (bb <= bx)
            break; // overflow

        x *= 2;
        bx = bb;
    }

    int64_t ans = 0;

    for (; x < a; a -= x)
        ans = (ans + bx) % m;

    return (ans + a*b) % m;
}
valdo
  • 12,632
  • 2
  • 37
  • 67
  • 1
    Can't you use `x=(1<<63-m)/b` ? That's rounded down, so `b*x <= 1<<63 - m`, and it doesn't take a loop to calculate that. Doesn't change big-O, as the number of for-loop iterations decreases by <50%. – MSalters Aug 28 '12 at 23:39
  • @MSalters: doesn't that introduce a division, which is potentially expensive? – Craig McQueen Nov 06 '13 at 00:17
  • @CraigMcQueen: Yes, but only one, and there's already a modulo in a loop. – MSalters Nov 06 '13 at 01:04
  • That's not improvement -- I don't know the O() for the last loop, but taking just 3 "random" numbers the loop ran over 300000 iterations. – Aki Suihkonen Feb 20 '14 at 07:48