How to perform modular multiplication (compute the square) without overflow:
As long as the modulus is at least one bit less than the maximum, the solution is to split the numbers in low-bit and high-bit halves, then perform the arithmetic piecemeal, kind of like grade-school multiplication where you multiply by the one’s digit, then shift the sum and multiply by the ten’s digit, and so on, except that the “digits” are the size of the square root of the maximum number that can be represented in the integer datatype.
Consider the example of calculating 56 * 37 modulo 100 using 8-bit arithmetic, so no intermediate total may be 256 or greater. We start by representing a = 56 = 3 * 16 + 8 and b = 37 = 2 * 16 + 5, (note that 16 is the square root of 256) so:
a1 = 8
a2 = 3
b1 = 5
b2 = 2
Then the four intermediate products with their shifts are:
p11 = 8 * 5 = 40
p12 = 8 * 2 = 16 > 32 > 64 > 128 (28) > 56
p21 = 3 * 5 = 15 > 30 > 60 > 120 (20) > 40
p22 = 3 * 2 = 6 > 12 > 24 > 48 > 96 > 192 (92) > 184 (84) > 168 (68) > 136 (36)
We’re using binary arithmetic, so each number doubles as it is shifted, taking it modulo 100 as we go. The product of two low-half numbers is not shifted, the product of a low-half and high-half number is shifted 4 times (since log2 16 = 4), and the product of two high-half numbers is shifted 8 times. Then the intermediate products are summed, again removing m each time an intermediate sum exceeds m:
s = 40 + 56 = 96
s = 96 + 40 = 136 (36)
s = 36 + 36 = 72
And that’s the final answer: 56 * 37 = 2072, which is 72 (mod 100).
If m is within one bit of the maximum for the integer data type, things get messier; the basic answer is to split into three parts, compute the intermediate products, and recombine.
See my blog for code in Scheme, and also a contributed solution in C that uses a somewhat different algorithm.