I am trying to find the most efficient way to compute modulo 255 of an 32-bit unsigned integer. My primary focus is to find an algorithm that works well across x86 and ARM platforms with an eye towards applicability beyond that. To first order, I am trying to avoid memory operations (which could be expensive), so I am looking for bit-twiddly approaches while avoiding tables. I am also trying to avoid potentially expensive operations such as branches and multiplies, and minimize the number of operations and registers used.
The ISO-C99 code below captures the eight variants I tried so far. It includes a framework for exhaustive test. I bolted onto this some crude execution time measurement which seems to work well enough to get a first performance impression. On the few platforms I tried (all with fast integer multiplies) the variants WARREN_MUL_SHR_2
, WARREN_MUL_SHR_1
, and DIGIT_SUM_CARRY_OUT_1
seem to be the most performant. My experiments show that the x86, ARM, PowerPC and MIPS compilers I tried at Compiler Explorer all make very good use of platform-specific features such as three-input LEA
, byte-expansion instructions, multiply-accumulate, and instruction predication.
The variant NAIVE_USING_DIV
uses an integer division, back-multiply with the divisor followed by subtraction. This is the baseline case. Modern compilers know how to efficiently implement the unsigned integer division by 255 (via multiplication) and will use a discrete replacement for the backmultiply where appropriate. To compute modulo base-1
one can sum base
digits, then fold the result. For example 3334 mod 9: sum 3+3+3+4 = 13, fold 1+3 = 4. If the result after folding is base-1
, we need to generate 0 instead. DIGIT_SUM_THEN_FOLD
uses this method.
A. Cockburn, "Efficient implementation of the OSI transport protocol checksum algorithm using 8/16-bit arithmetic", ACM SIGCOMM Computer Communication Review, Vol. 17, No. 3, July/Aug. 1987, pp. 13-20
showed a different way of adding digits modulo base-1
efficiently in the context of a checksum computation modulo 255. Compute a byte-wise sum of the digits, and after each addition, add any carry-out from the addition as well. So this would be an ADD a, b
, ADC a, 0
sequence. Writing out the addition chain for this using base 256
digits it becomes clear that the computation is basically a multiply with 0x0101 ... 0101
. The result will be in the most significant digit position, except that one needs to capture the carry-out from the addition in that position separately. This method only works when a base
digit comprises 2k bits. Here we have k=3
. I tried three different ways of remapping a result of base-1
to 0, resulting in variants DIGIT_SUM_CARRY_OUT_1
, DIGIT_SUM_CARRY_OUT_2
, DIGIT_SUM_CARRY_OUT_3
.
An intriguing approach to computing modulo-63 efficiently was demonstrated by Joe Keane in the newsgroup comp.lang.c on 1995/07/09. While thread participant Peter L. Montgomery proved the algorithm correct, unfortunately Mr. Keane did not respond to requests to explain its derivation. This algorithm is also reproduced in H. Warren's Hacker's Delight 2nd ed. I was able to extend it, in purely mechanical fashion, to modulo-127 and modulo-255. This is the (appropriately named) KEANE_MAGIC variant. Update: Since I originally posted this question, I have worked out that Keane's approach is basically a clever fixed-point implementation of the following: return (uint32_t)(fmod (x * 256.0 / 255.0 + 0.5, 256.0) * (255.0 / 256.0));
. This makes it a close relative of the next variant.
Henry S. Warren, Hacker's Delight 2nd ed., p. 272 shows a "multiply-shift-right" algorithm, presumably devised by the author themself, that is based on the mathematical property that n mod 2k-1 = floor (2k / 2k-1 * n) mod 2k. Fixed point computation is used to multiply with the factor 2k / 2k-1. I constructed two variants of this that differ in how they handle the mapping of a preliminary result of base-1
to 0. These are variants WARREN_MUL_SHR_1
and WARREN_MUL_SHR_2
.
Are there algorithms for modulo-255 computation that are even more efficient than the three top contenders I have identified so far, in particular for platforms with slow integer multiplies? An efficient modification of Keane's multiplication-free algorithm for the summing of four base 256
digits would seem to be of particular interest in this context.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define NAIVE_USING_DIV (1)
#define DIGIT_SUM_THEN_FOLD (2)
#define DIGIT_SUM_CARRY_OUT_1 (3)
#define DIGIT_SUM_CARRY_OUT_2 (4)
#define DIGIT_SUM_CARRY_OUT_3 (5)
#define KEANE_MAGIC (6) // Joe Keane, comp.lang.c, 1995/07/09
#define WARREN_MUL_SHR_1 (7) // Hacker's Delight, 2nd ed., p. 272
#define WARREN_MUL_SHR_2 (8) // Hacker's Delight, 2nd ed., p. 272
#define VARIANT (WARREN_MUL_SHR_2)
uint32_t mod255 (uint32_t x)
{
#if VARIANT == NAIVE_USING_DIV
return x - 255 * (x / 255);
#elif VARIANT == DIGIT_SUM_THEN_FOLD
x = (x & 0xffff) + (x >> 16);
x = (x & 0xff) + (x >> 8);
x = (x & 0xff) + (x >> 8) + 1;
x = (x & 0xff) + (x >> 8) - 1;
return x;
#elif VARIANT == DIGIT_SUM_CARRY_OUT_1
uint32_t t;
t = 0x01010101 * x;
t = (t >> 24) + (t < x);
if (t == 255) t = 0;
return t;
#elif VARIANT == DIGIT_SUM_CARRY_OUT_2
uint32_t t;
t = 0x01010101 * x;
t = (t >> 24) + (t < x) + 1;
t = (t & 0xff) + (t >> 8) - 1;
return t;
#elif VARIANT == DIGIT_SUM_CARRY_OUT_3
uint32_t t;
t = 0x01010101 * x;
t = (t >> 24) + (t < x);
t = t & ((t - 255) >> 8);
return t;
#elif VARIANT == KEANE_MAGIC
x = (((x >> 16) + x) >> 14) + (x << 2);
x = ((x >> 8) + x + 2) & 0x3ff;
x = (x - (x >> 8)) >> 2;
return x;
#elif VARIANT == WARREN_MUL_SHR_1
x = (0x01010101 * x + (x >> 8)) >> 24;
x = x & ((x - 255) >> 8);
return x;
#elif VARIANT == WARREN_MUL_SHR_2
x = (0x01010101 * x + (x >> 8)) >> 24;
if (x == 255) x = 0;
return x;
#else
#error unknown VARIANT
#endif
}
uint32_t ref_mod255 (uint32_t x)
{
volatile uint32_t t = x;
t = t % 255;
return t;
}
// timing with microsecond resolution
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
LARGE_INTEGER t;
static double oofreq;
static int checkedForHighResTimer;
static BOOL hasHighResTimer;
if (!checkedForHighResTimer) {
hasHighResTimer = QueryPerformanceFrequency (&t);
oofreq = 1.0 / (double)t.QuadPart;
checkedForHighResTimer = 1;
}
if (hasHighResTimer) {
QueryPerformanceCounter (&t);
return (double)t.QuadPart * oofreq;
} else {
return (double)GetTickCount() * 1.0e-3;
}
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif
int main (void)
{
double start, stop;
uint32_t res, ref, x = 0;
printf ("Testing VARIANT = %d\n", VARIANT);
start = second();
do {
res = mod255 (x);
ref = ref_mod255 (x);
if (res != ref) {
printf ("error @ %08x: res=%08x ref=%08x\n", x, res, ref);
return EXIT_FAILURE;
}
x++;
} while (x);
stop = second();
printf ("test passed\n");
printf ("elapsed = %.6f seconds\n", stop - start);
return EXIT_SUCCESS;
}