[Edit2] below for performance notes
An attempt with only 1 if
condition.
This approach is O(log2(sizeof unsigned)). Run time would increase by 1 set of ands/shifts/add rather than twice the time with a loop approach should code use uint64_t
.
unsigned mod31(uint32_t x) {
#define m31 (31lu)
#define m3131 ((m31 << 5) | m31)
#define m31313131 ((m3131 << 10) | m3131)
static const uint32_t mask1 = (m31 << 0) | (m31 << 10) | (m31 << 20) | (m31 << 30);
static const uint32_t mask2 = (m31 << 5) | (m31 << 15) | (m31 << 25);
uint32_t a = x & mask1;
uint32_t b = x & mask2;
x = a + (b >> 5);
// x = xx 0000x xxxxx 0000x xxxxx 0000x xxxxx
a = x & m31313131;
b = x & (m31313131 << 20);
x = a + (b >> 20);
// x = 00 00000 00000 000xx xxxxx 000xx xxxxx
a = x & m3131;
b = x & (m3131 << 10);
x = a + (b >> 10);
// x = 00 00000 00000 00000 00000 00xxx xxxxx
a = x & m31;
b = x & (m31 << 5);
x = a + (b >> 5);
// x = 00 00000 00000 00000 00000 0000x xxxxx
return x >= 31 ? x-31 : x;
}
[Edit]
The first addition method sums the individual 7 groups of five bit in parallel. Subsequent additions bring the 7 group into 4, then 2, then 1. This final 7-bit sum then proceeds to add its upper half (2-bits) to its lower half(5-bits). Code then uses one test to perform the final "mod".
This method scales for wider unsigned
up to at least uint165_t
log2(31+1)*(31+2). Pass that, a little more code is needed.
See @rici for some good optimizations. Still recommend using uint32_t
vs. unsigned
and 31UL
in shifts like 31U << 15
as an unsigned 31U
may only be 16 bits long. (16 bit int
popular in embedded world in 2014).
[Edit2]
Besides letting the compiler use its optimizer, 2 additional techniques sped performance. These are more minor parlor tricks that yielded a modest improvement. Keep in mind YMMV and this is for a 32-bit unsigned
.
Using a table look-up for the last modulo
improved 10-20%. Using unsigned t
table rather than unsigned char t
helped a bit too. It turned out that table length, as first expected needed to be 2*31, only needed 31+5.
Using a local variable rather than always calling the function parameter surprisingly helped. Likely a weakness in my gcc compiler.
Found non-branching solutions, not shown, to replace x >= 31 ? x-31 : x
. but their coding complexity was greater and performance was slower.
All-in-all, a fun exercise.
unsigned mod31quik(unsigned xx) {
#define mask (31u | (31u << 10) | (31u << 20) | (31u << 30))
unsigned x = (xx & mask) + ((xx >> 5) & mask);
x += x >> 20;
x += x >> 10;
x = (x & 31u) + ((x >> 5) & 31u);
static const unsigned char t[31 * 2 /* 36 */] = { 0, 1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
return t[x];
}