How to calculate (A*B)%C?

Question

Can someone help me how to calculate (A*B)%C, where 1<=A,B,C<=10^18 in C++, without big-num, just a mathematical approach.

I swear this is a duplicate of a bunch of things. They're tricky to find though... — Mysticial, Jun 24 '13 at 19:11
i was about to try to answer and then i saw that @Mysticial was here, lol i'm out — Stephan, Jun 24 '13 at 19:13
http://stackoverflow.com/questions/10076011/overflow-aa-mod-n, http://stackoverflow.com/questions/14857702/specific-modular-multiplication-algorithm, http://stackoverflow.com/questions/14858476/multiply-two-overflowing-integers-modulo-a-third, I'm having trouble finding the one I'm looking for... :( — Mysticial, Jun 24 '13 at 19:14
Well i'm sure you can try storing the numbers in arrays as binary and then working with it as if with shift operators. — Alexandru Barbarosie, Jun 24 '13 at 19:15
Yes, it is possible, but there is no built-in support for this in the C++ language. You have to build it yourself. — John Dibling, Jun 24 '13 at 19:15
@Mysticial Isn't [this](http://stackoverflow.com/a/14859713/256138) what you're looking for? — rubenvb, Jun 24 '13 at 19:16
@devnull Whether this is homework and the OP wants to cheat is besides the point. IMO, this question is pretty well written. It's clear, concise, and useful to others. (even if it may be a dupe - which in that case, we find that dupe and close it appropriately) — Mysticial, Jun 24 '13 at 19:26
@Mystical but it doesn't show any effort on part of the OP, does it? — devnull, Jun 24 '13 at 19:29
@devnull The goal of SO is to be an archive of QAs that are helpful to others. If the Q happens to lack effort, then so be it if it manages to be helpful. This question is different from the "usual" lack of effort questions because it isn't too localized (it's even got a gazillion dupes). Just run down the list of the top questions on SO. Many of them show just as little effort. But they have thousands of votes because they are helpful - IOW accomplishing the goal of SO. — Mysticial, Jun 24 '13 at 19:34

Ben Voigt · Accepted Answer · 2013-06-24T21:26:06.513

6

Off the top of my head (not extensively tested)

typedef unsigned long long BIG;
BIG mod_multiply( BIG A, BIG B, BIG C )
{
    BIG mod_product = 0;
    A %= C;

    while (A) {
        B %= C;
        if (A & 1) mod_product = (mod_product + B) % C;
        A >>= 1;
        B <<= 1;
    }

    return mod_product;
}

This has complexity O(log A) iterations. You can probably replace most of the % with a conditional subtraction, for a bit more performance.

typedef unsigned long long BIG;
BIG mod_multiply( BIG A, BIG B, BIG C )
{
    BIG mod_product = 0;
    // A %= C; may or may not help performance
    B %= C;

    while (A) {
        if (A & 1) {
            mod_product += B;
            if (mod_product > C) mod_product -= C;
        }
        A >>= 1;
        B <<= 1;
        if (B > C) B -= C;
    }

    return mod_product;
}

This version has only one long integer modulo -- it may even be faster than the large-chunk method, depending on how your processor implements integer modulo.

Live demo: https://ideone.com/1pTldb -- same result as Yakk's.

edited Jun 24 '13 at 21:26

answered Jun 24 '13 at 19:47

Ben Voigt

277,958
43
419
720

1

`if (mod_product > C) mod_product -= C;` won't make it faster I'd bet -- replacing `%` with a branch is not a win: `BIG tmp[]={mod_product, mod_product-C}; mod_product = tmp[mod_product>=C];` is equivalent, but under most modern processors/compilers, be much faster. Similarly killing the `A&1` branch would be good. – Yakk - Adam Nevraumont Jun 24 '13 at 20:56
@Yakk: It entirely depends on the processor. Some have a conditional mov instruction which is extremely efficient. On others, your array lookup could be better. But 64-bit `%` is ridiculously slow on virtually all systems, in many cases slower than a pipeline flush. On processors where an array lookup is faster, don't you think the optimizer would already know about that? – Ben Voigt Jun 24 '13 at 21:06
Furthermore, if the optimizer isn't up to the task of eliminating branches, I have other tricks I'd use before a lookup table. – Ben Voigt Jun 24 '13 at 21:09
I haven't met many compilers that will eliminate branches like that at all well... I didn't think of the slowness of `%` on 64 bit. – Yakk - Adam Nevraumont Jun 24 '13 at 21:10
@Yakk: Integer division and modulus are slow instructions. Even on processors with uniform cycle count for every instruction, these are usually an exception. – Ben Voigt Jun 24 '13 at 21:12
Here's what Intel says about `%`: "The latency and throughput of IDIV in Enhanced Intel Core microarchitecture varies with operand sizes and with the number of significant digits of the quotient of the division. If the quotient is zero, the minimum latency can be 13 cycles, and the minimum throughput can be 5 cycles. Latency and throughput of IDIV increases with the number of significant digit of the quotient. The latency and throughput of IDIV with 64-bit operand are significantly slower than those with 32-bit operand. Latency of DIV is similar to IDIV." – Ben Voigt Jun 24 '13 at 21:21

score 0 · Answer 2 · edited Sep 07 '20 at 12:56

An implementation of this stack overflow answer prior:

#include <stdint.h>
#include <tuple>
#include <iostream>

typedef std::tuple< uint32_t, uint32_t > split_t;
split_t split( uint64_t a )
{
  static const uint32_t mask = -1;
  auto retval = std::make_tuple( mask&a, ( a >> 32 ) );
  // std::cout << "(" << std::get<0>(retval) << "," << std::get<1>(retval) << ")\n";
  return retval;
}

typedef std::tuple< uint64_t, uint64_t, uint64_t, uint64_t > cross_t;
template<typename Lambda>
cross_t cross( split_t lhs, split_t rhs, Lambda&& op )
{
  return std::make_tuple( 
    op(std::get<0>(lhs), std::get<0>(rhs)),
    op(std::get<1>(lhs), std::get<0>(rhs)),
    op(std::get<0>(lhs), std::get<1>(rhs)),
    op(std::get<1>(lhs), std::get<1>(rhs))
  );
}

// c must have high bit unset:
uint64_t a_times_2_k_mod_c( uint64_t a, unsigned k, uint64_t c )
{
  a %= c;
  for (unsigned i = 0; i < k; ++i)
  {
    a <<= 1;
    a %= c;
  }
  return a;
}

// c must have about 2 high bits unset:
uint64_t a_times_b_mod_c( uint64_t a, uint64_t b, uint64_t c )
{
  // ensure a and b are < c:
  a %= c;
  b %= c;
  
  auto Z = cross( split(a), split(b), [](uint32_t lhs, uint32_t rhs)->uint64_t {
    return (uint64_t)lhs * (uint64_t)rhs;
  } );
  
  uint64_t to_the_0;
  uint64_t to_the_32_a;
  uint64_t to_the_32_b;
  uint64_t to_the_64;
  std::tie( to_the_0, to_the_32_a, to_the_32_b, to_the_64 ) = Z;
  
  // std::cout << to_the_0 << "+ 2^32 *(" << to_the_32_a << "+" << to_the_32_b << ") + 2^64 * " << to_the_64 << "\n";
  
  // this line is the one that requires 2 high bits in c to be clear
  // if you just add 2 of them then do a %c, then add the third and do
  // a %c, you can relax the requirement to "one high bit must be unset":
  return
    (to_the_0
    + a_times_2_k_mod_c(to_the_32_a+to_the_32_b, 32, c) // + will not overflow!
    + a_times_2_k_mod_c(to_the_64, 64, c) )
  %c;
}

int main()
{
  uint64_t retval = a_times_b_mod_c( 19010000000000000000, 1011000000000000, 1231231231231211 );
  std::cout << retval << "\n";
}

The idea here is to split your 64-bit integer into a pair of 32-bit integers, which are safe to multiply in 64-bit land.

We express a*b as (a_high * 2^32 + a_low) * (b_high * 2^32 + b_low), do the 4-fold multiplication (keeping track of the 2³² factors without storing them in our bits), then note that doing a * 2^k % c can be done via a series of k repeats of this pattern: ((a*2 %c) *2%c).... So we can take this 3 to 4 element polynomial of 64-bit integers in 2³² and reduce it without having to worry about things.

The expensive part is the a_times_2_k_mod_c function (the only loop).

You can make it go many times faster if you know that c has more than one high bit clear.

You could instead replace the a %= c with subtraction a -= (a>=c)*c;

Doing both isn't all that practical.

Live example

Also, your single call to `a_times_2_k_mod_c(to_the_64, 64, c)` does pretty much the same work as my entire function. I fail to see any advantage to your splitting approach... or pulling in a bunch of newfangled C++ templates to do it. — Ben Voigt, Jun 24 '13 at 21:25
@BenVoigt It trades subtractions/additions for 4 multiplications, unless I counted wrong? Quite possibly I did. Yours is much cleaner -- I just ported someone's post to an earlier question to an implementation, so it is pretty crufty. — Yakk - Adam Nevraumont, Jun 24 '13 at 21:47
Yes but any benefit you might have had you lose by running the shift loop 96 times instead of (at most) 64. Perhaps `a_times_2_k_mod_c(to_the_32_a+to_the_32_b+a_times_2_k_mod_c(to_the_64, 32, c), 32, c)` would help. — Ben Voigt, Jun 24 '13 at 22:07

How to calculate (A*B)%C?

2 Answers2

Linked