The modulus has already been explained, nevertheless, let's recapitulate.
To find the remainder of k
modulo 2^n-1
, write
k = a + 2^n*b, 0 <= a < 2^n
Then
k = a + ((2^n-1) + 1) * b
= (a + b) + (2^n-1)*b
≡ (a + b) (mod 2^n-1)
If a + b >= 2^n
, repeat until the remainder is less than 2^n
, and if that leads you to a + b = 2^n-1
, replace that with 0. Each "shift right by n
and add to the last n
bits" moves the first set bit right by n
or n-1
places (unless k < 2^(2*n-1)
, when the first set bit after the shift-and-add may be the 2^n
bit). So if the width of the type is large compared to n
, this will need many shifts - consider a 128-bit type and n = 3
, for large k
you will need over 40 shifts. To reduce the number of shifts required, you can exploit the fact that
2^(m*n) - 1 = (2^n - 1) * (2^((m-1)*n) + 2^((m-2)*n) + ... + 2^(2*n) + 2^n + 1),
of which we will only use that 2^n - 1
divides 2^(m*n) - 1
for all m > 0
. Then you shift by multiples of n
that are roughly half the maximal bit-length the value can have at that step. For the above example of a 128-bit type and the remainder modulo 7 (2^3 - 1
), the closest multiples of 3 to 128/2 are 63 and 66, first shift by 63 bits
r_1 = (k & (2^63 - 1)) + (k >> 63) // r_1 < 2^63 + 2^(128-63) < 2^66
to get a number with at most 66 bits, then shift by 66/2 = 33 bits
r_2 = (r_1 & (2^33 - 1)) + (r_1 >> 33) // r_2 < 2^33 + 2^(66-33) = 2^34
to reach at most 34 bits. Next shift by 18 bits, then 9, 6, 3
r_3 = (r_2 & (2^18 - 1)) + (r_2 >> 18) // r_3 < 2^18 + 2^(34-18) < 2^19
r_4 = (r_3 & (2^9 - 1)) + (r_3 >> 9) // r_4 < 2^9 + 2^(19-9) < 2^11
r_5 = (r_4 & (2^6 - 1)) + (r_4 >> 6) // r_5 < 2^6 + 2^(11-6) < 2^7
r_6 = (r_5 & (2^3 - 1)) + (r_5 >> 3) // r_6 < 2^3 + 2^(7-3) < 2^5
r_7 = (r_6 & (2^3 - 1)) + (r_6 >> 3) // r_7 < 2^3 + 2^(5-3) < 2^4
Now a single subtraction if r_7 >= 2^3 - 1
suffices. To calculate k % (2^n -1)
in a b-bit type, O(log2 (b/n)) shifts are needed.
The quotient is obtained similarly, again we write
k = a + 2^n*b, 0 <= a < 2^n
= a + ((2^n-1) + 1)*b
= (2^n-1)*b + (a+b),
so k/(2^n-1) = b + (a+b)/(2^n-1)
, and we continue while a+b > 2^n-1
. Here we unfortunately cannot reduce the work by shifting and masking about half the width, so the method is only efficient when n
is not much smaller than the width of the type.
Code for the fast cases where n
is not too small:
unsigned long long modulus_2n1(unsigned n, unsigned long long k) {
unsigned long long mask = (1ULL << n) - 1ULL;
while(k > mask) {
k = (k & mask) + (k >> n);
}
return k == mask ? 0 : k;
}
unsigned long long quotient_2n1(unsigned n, unsigned long long k) {
unsigned long long mask = (1ULL << n) - 1ULL, quotient = 0;
while(k > mask) {
quotient += k >> n;
k = (k & mask) + (k >> n);
}
return k == mask ? quotient + 1 : quotient;
}
For the special case where n
is half the width of the type, the loop runs at most twice, so if branches are expensive, it may be better to unroll the loop and unconditionally execute the loop body twice.