Efficient modulo-255 computation

Question

I am trying to find the most efficient way to compute modulo 255 of an 32-bit unsigned integer. My primary focus is to find an algorithm that works well across x86 and ARM platforms with an eye towards applicability beyond that. To first order, I am trying to avoid memory operations (which could be expensive), so I am looking for bit-twiddly approaches while avoiding tables. I am also trying to avoid potentially expensive operations such as branches and multiplies, and minimize the number of operations and registers used.

The ISO-C99 code below captures the eight variants I tried so far. It includes a framework for exhaustive test. I bolted onto this some crude execution time measurement which seems to work well enough to get a first performance impression. On the few platforms I tried (all with fast integer multiplies) the variants WARREN_MUL_SHR_2, WARREN_MUL_SHR_1, and DIGIT_SUM_CARRY_OUT_1 seem to be the most performant. My experiments show that the x86, ARM, PowerPC and MIPS compilers I tried at Compiler Explorer all make very good use of platform-specific features such as three-input LEA, byte-expansion instructions, multiply-accumulate, and instruction predication.

The variant NAIVE_USING_DIV uses an integer division, back-multiply with the divisor followed by subtraction. This is the baseline case. Modern compilers know how to efficiently implement the unsigned integer division by 255 (via multiplication) and will use a discrete replacement for the backmultiply where appropriate. To compute modulo base-1 one can sum base digits, then fold the result. For example 3334 mod 9: sum 3+3+3+4 = 13, fold 1+3 = 4. If the result after folding is base-1, we need to generate 0 instead. DIGIT_SUM_THEN_FOLD uses this method.

A. Cockburn, "Efficient implementation of the OSI transport protocol checksum algorithm using 8/16-bit arithmetic", ACM SIGCOMM Computer Communication Review, Vol. 17, No. 3, July/Aug. 1987, pp. 13-20

showed a different way of adding digits modulo base-1 efficiently in the context of a checksum computation modulo 255. Compute a byte-wise sum of the digits, and after each addition, add any carry-out from the addition as well. So this would be an ADD a, b, ADC a, 0 sequence. Writing out the addition chain for this using base 256 digits it becomes clear that the computation is basically a multiply with 0x0101 ... 0101. The result will be in the most significant digit position, except that one needs to capture the carry-out from the addition in that position separately. This method only works when a base digit comprises 2^k bits. Here we have k=3. I tried three different ways of remapping a result of base-1 to 0, resulting in variants DIGIT_SUM_CARRY_OUT_1, DIGIT_SUM_CARRY_OUT_2, DIGIT_SUM_CARRY_OUT_3.

An intriguing approach to computing modulo-63 efficiently was demonstrated by Joe Keane in the newsgroup comp.lang.c on 1995/07/09. While thread participant Peter L. Montgomery proved the algorithm correct, unfortunately Mr. Keane did not respond to requests to explain its derivation. This algorithm is also reproduced in H. Warren's Hacker's Delight 2nd ed. I was able to extend it, in purely mechanical fashion, to modulo-127 and modulo-255. This is the (appropriately named) KEANE_MAGIC variant. Update: Since I originally posted this question, I have worked out that Keane's approach is basically a clever fixed-point implementation of the following: return (uint32_t)(fmod (x * 256.0 / 255.0 + 0.5, 256.0) * (255.0 / 256.0));. This makes it a close relative of the next variant.

Henry S. Warren, Hacker's Delight 2nd ed., p. 272 shows a "multiply-shift-right" algorithm, presumably devised by the author themself, that is based on the mathematical property that n mod 2^k-1 = floor (2^k / 2^k-1 * n) mod 2^k. Fixed point computation is used to multiply with the factor 2^k / 2^k-1. I constructed two variants of this that differ in how they handle the mapping of a preliminary result of base-1 to 0. These are variants WARREN_MUL_SHR_1 and WARREN_MUL_SHR_2.

Are there algorithms for modulo-255 computation that are even more efficient than the three top contenders I have identified so far, in particular for platforms with slow integer multiplies? An efficient modification of Keane's multiplication-free algorithm for the summing of four base 256 digits would seem to be of particular interest in this context.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define NAIVE_USING_DIV       (1)
#define DIGIT_SUM_THEN_FOLD   (2)
#define DIGIT_SUM_CARRY_OUT_1 (3)
#define DIGIT_SUM_CARRY_OUT_2 (4)
#define DIGIT_SUM_CARRY_OUT_3 (5)
#define KEANE_MAGIC           (6)  // Joe Keane, comp.lang.c, 1995/07/09
#define WARREN_MUL_SHR_1      (7)  // Hacker's Delight, 2nd ed., p. 272
#define WARREN_MUL_SHR_2      (8)  // Hacker's Delight, 2nd ed., p. 272

#define VARIANT (WARREN_MUL_SHR_2)

uint32_t mod255 (uint32_t x)
{
#if VARIANT == NAIVE_USING_DIV
    return x - 255 * (x / 255);
#elif VARIANT == DIGIT_SUM_THEN_FOLD
    x = (x & 0xffff) + (x >> 16);
    x = (x & 0xff) + (x >> 8);
    x = (x & 0xff) + (x >> 8) + 1;
    x = (x & 0xff) + (x >> 8) - 1;
    return x;
#elif VARIANT == DIGIT_SUM_CARRY_OUT_1
    uint32_t t;
    t = 0x01010101 * x;
    t = (t >> 24) + (t < x);
    if (t == 255) t = 0;
    return t;
#elif VARIANT == DIGIT_SUM_CARRY_OUT_2
    uint32_t t;
    t = 0x01010101 * x;
    t = (t >> 24) + (t < x) + 1;
    t = (t & 0xff) + (t >> 8) - 1;
    return t;
#elif VARIANT == DIGIT_SUM_CARRY_OUT_3
    uint32_t t;
    t = 0x01010101 * x;
    t = (t >> 24) + (t < x);
    t = t & ((t - 255) >> 8);
    return t;
#elif VARIANT == KEANE_MAGIC
    x = (((x >> 16) + x) >> 14) + (x << 2);
    x = ((x >> 8) + x + 2) & 0x3ff;
    x = (x - (x >> 8)) >> 2;
    return x;
#elif VARIANT == WARREN_MUL_SHR_1
    x = (0x01010101 * x + (x >> 8)) >> 24;
    x = x & ((x - 255) >> 8);
    return x;
#elif VARIANT == WARREN_MUL_SHR_2
    x = (0x01010101 * x + (x >> 8)) >> 24;
    if (x == 255) x = 0;
    return x;
#else
#error unknown VARIANT
#endif
}

uint32_t ref_mod255 (uint32_t x)
{
    volatile uint32_t t = x;
    t = t % 255;
    return t;
}

// timing with microsecond resolution
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

int main (void)
{
    double start, stop;
    uint32_t res, ref, x = 0;

    printf ("Testing VARIANT = %d\n", VARIANT);
    start = second();
    do {
        res = mod255 (x);
        ref = ref_mod255 (x);
        if (res != ref) {
            printf ("error @ %08x: res=%08x ref=%08x\n", x, res, ref);
            return EXIT_FAILURE;
        }        
        x++;
    } while (x);
    stop = second();
    printf ("test passed\n");
    printf ("elapsed = %.6f seconds\n", stop - start);
    return EXIT_SUCCESS;
}

Is your ultimate goal a C implementation that compiles to nice asm (with which compilers?), or are you considering hand-written asm if one of these algorithms (or another one) runs into missed optimizations, or inability to access add-width-carry from C? Since you mention multiply-like additions, have you considered widening multiply, if that gives you the carry out you want in the low bit of the high half? Most ISAs can do that, although ARM Cortex-M0 doesn't have widening multiply. — Peter Cordes, Jun 21 '21 at 20:52
@PeterCordes I am looking for algorithms mapped to C code that compiles to efficient machine code (using clang and gcc, for the most part; I think the Intel compiler is currently in the process of transitioning to clang?). I definitely don't want to use assembly language as I pretty much stopped using it some 15 years ago. Looking through the code generated for my eight variants it seemed to me that the compilers made available by Compiler Explorer are doing a very good job on this code. — njuffa, Jun 21 '21 at 21:03
Not sure I can justify how it works ... but `x = ((x + x / 255u) & 0xFFu);` appears to always give the correct answer and, on my local machine (clang-cl/Visual Studio, 64-bit Windows-10) consistently out-performs all your other methods. One division, one addition and a trivial mask. — Adrian Mole, Jun 22 '21 at 00:39
@AdrianMole Letting `x = 255 q + r` with `0 ≤ r < 255`, the formula works out to `(255 q + r + q) mod 256 = (256 q + r) mod 256 = r`, as desired. — David Eisenstat, Jun 22 '21 at 01:11
@DavidEisenstat I know that the mask is a *very* cheap way of taking `x % 256` but what I can't (yet) *formally* prove is that `x % n` is the same as `(x + x/n) % (n + 1)` (which is what my trick does). Your comment helps a lot. Thanks. Bed-time in blighty but I'll maybe dream about a more formal answer? — Adrian Mole, Jun 22 '21 at 01:40
@AdrianMole Thanks, that works fine. I would appreciate if you could write it up as an answer. If it also works for other divisors of the form 2**k-1, that would be a bonus. As for performance: On the platform I am writing this comment on (older x86), it is identical in speed to `WARREN_MUL_SHR_1` and a tad slower than `WARREN_MUL_SHR_2`. Excellent. — njuffa, Jun 22 '21 at 02:07
@AdrianMole I confirmed that on an Intel Xeon W-2133 with MSVS 2019 `/Ox /arch:AVX2` your solution is faster than any of the variants in my question. — njuffa, Jun 22 '21 at 04:10
@njuffa: Note that latency and throughput are separate things on modern OoO exec CPUs. Storing to a volatile "sink" variable will test throughput (with one slot of front-end bandwidth used up by the store, unlike with a Benchmark::DoNotOptimize to just materialize a variable in a register). To test latency, one can do `x = mod255(x) + 0xabcdef` in a loop. (Random constant to stop the compiler from seeing through the operations). I guess future readers might be interested in either, and maybe also ability to auto-vectorize, so IDK if there's any need to state which you want in the question. — Peter Cordes, Jun 22 '21 at 16:47
@PeterCordes I am well aware of the difference between latency and throughput. And yes, I am interested in throughput and measuring that here (there are clearly better ways to do that; which is why I stated I used a **crude** way here). Our experiences may differ, but throughput is largely what matters in modern processing according to mine. — njuffa, Jun 22 '21 at 19:14
In loops over lots of data, yes throughput (often front-end) is often a bottleneck. (And yeah I know you know the difference, part of that comment was copy/pasted from a comment on an answer that said "faster" without details). Branch mispredicts and serial dependencies exist in some code. Compilers often seem to favour latency over throughput when it only costs an extra uop, e.g. multiply with 2x LEA instructions (1c each) instead of one `imul` (3 cycles with `-march=sandybridge`). e.g. clang changed its threshold from one instruction to two in clang3.7 to https://godbolt.org/z/bbEfdGGeY. — Peter Cordes, Jun 22 '21 at 19:23
If we assume compiler devs aren't crazy and did benchmark these changes, it seems likely that latency matters outside of high-throughput loops. Or maybe it's just that compilers aren't good at noticing loop-carried dependency chains and optimizing for latency instead of throughput, and latency is the safer bet? Outside of high-throughput loops, I guess it's not rare for CPUs to encounter enough serial dependencies to limit ILP to less than their pipeline width. But fewer uops takes less space in the RS & ROB, giving them a larger out-of-order window, if branches are correctly predicted. — Peter Cordes, Jun 22 '21 at 19:26
I'll leave this link here for the future. https://lemire.me/blog/2019/02/08/faster-remainders-when-the-divisor-is-a-constant-beating-compilers-and-libdivide/ — Sopel, Jun 24 '21 at 16:01

Adrian Mole · Accepted Answer · 2021-06-22T19:40:38.923

13

For arbitrary unsigned integers, x and n, evaluating the modulo expression x % n involves (conceptually, at least), three operations: division, multiplication and subtraction:

quotient = x / n;
product = quotient * n;
modulus = x - product;

However, when n is a power of 2 (n = 2^p), the modulo can be determined much more rapidly, simply by masking out all but the lower p bits.

On most CPUs, addition, subtraction and bit-masking are very 'cheap' (rapid) operations, multiplication is more 'expensive' and division is very expensive – but note that most optimizing compilers will convert division by a compile-time constant into a multiplication (by a different constant) and a bit-shift (vide infra).

Thus, if we can convert our modulo 255 into a modulo 256, without too much overhead, we can likely speed up the process. We can do just this by noting that x % n is equivalent to (x + x / n) % (n + 1)^†. Thus, our conceptual operations are now: division, addition and masking.

In the specific case of masking the lower 8 bits, x86/x64-based CPUs (and others?) will likely be able to perform a further optimization, as they can access 8-bit versions of (most) registers.

Here's what the clang-cl compiler generates for a naïve modulo 255 function (argument passed in ecx and returned in eax):

unsigned Naive255(unsigned x)
{
    return x % 255;
}

    mov     edx, ecx
    mov     eax, 2155905153 ;
    imul    rax, rdx        ; Replacing the IDIV with IMUL and SHR
    shr     rax, 39         ;
    mov     edx, eax
    shl     edx, 8
    sub     eax, edx
    add     eax, ecx

And here's the (clearly faster) code generated using the 'trick' described above:

unsigned Trick255(unsigned x)
{
    return (x + x / 255) & 0xFF;
}

    mov     eax, ecx
    mov     edx, 2155905153
    imul    rdx, rax
    shr     rdx, 39
    add     edx, ecx
    movzx   eax, dl         ; Faster than an explicit AND mask?

Testing this code on a Windows-10 (64-bit) platform (Intel® Core™ i7-8550U CPU) shows that it significantly (but not hugely) out-performs the other algorithms presented in the question.

^† The answer given by David Eisenstat explains how/why this equivalence is valid.

edited Jun 22 '21 at 19:40

answered Jun 22 '21 at 13:37

Adrian Mole

49,934
160
51
83

2

Now that you point it out, compilers should learn this peephole / special-case trick for `unsigned % 255` so they do it for you when you write `x % 255` like a normal person. Stuff like this is, after all, the point of an ahead-of-time compiler that can spend time looking for optimization like this. Once the dust settles on this Q&A, the best options (for each ISA and/or microarchitecture) should get reported on https://bugs.llvm.org/ and https://gcc.gnu.org/bugzilla/ as missed-optimization bugs. – Peter Cordes Jun 22 '21 at 13:48
Re: testing: fortunately computers are fast enough to brute-force test literally every `uint32_t` input with `Naive255(x) == Trick255(x)`. – Peter Cordes Jun 22 '21 at 13:51
1

And yes, `movzx eax, dl` is at least as good as `and edx, 0xff`, even if we were ok with leaving the result in EDX instead of EAX. On some CPUs, it's one ALU uop for any execution port (like AND) to copy + zext, but on Intel IvyBridge and later, mov-elimination works on 8->32-bit zero extending mov, as long as it's between different registers. So it has 0 latency, one uop for the front-end but zero for the back-end. [Can x86's MOV really be "free"? Why can't I reproduce this at all?](https://stackoverflow.com/q/44169342). (AMD's mov-elim only handles `mov eax,edx` or `mov rax,rdx`) – Peter Cordes Jun 22 '21 at 13:54
1

We're talking about compilers that can already turn a whole loop into a `popcnt`, or a `sum += i` loop into the closed-form Gauss's formula (clang anyway)! And use a shift or AND for power-of-2 divisors. Special-casing 255 to use a canned sequence of operations like `(x + x / 255) & 0xFF` as if you'd written that in the source is seriously not a big deal, and can be done in a target-independent optimization pass. – Peter Cordes Jun 22 '21 at 14:01
2

Or more relevant, recent GCC versions already have special-case peepholes for stuff like `x % 7 == 0` to do it more efficiently than `return x % 7`: https://godbolt.org/z/6qco9rjPc, new in GCC9. (Using [this trick](https://stackoverflow.com/a/40062768/224132)). Related: https://lemire.me/blog/2019/02/08/faster-remainders-when-the-divisor-is-a-constant-beating-compilers-and-libdivide/ – Peter Cordes Jun 22 '21 at 14:02
@Peter And it can likely be extended to other cases. Not sure about something like `u % 254` but certainly other 2^p - 1 modulo ops. – Adrian Mole Jun 22 '21 at 14:03
Ok, great, it's easy to check if `n + 1` is a power of 2, so that should be useful for compilers to look for. (Also, found the Q&A I was actually looking for: [Fast divisibility tests (by 2,3,4,5,.., 16)?](https://stackoverflow.com/a/49264279) has some new work, I think what GCC is using, including the `ror` by 2 for `x%12 == 0`) – Peter Cordes Jun 22 '21 at 14:05

David Eisenstat · Answer 2 · 2021-06-22T05:07:35.100

Here’s my sense of how the fastest answers work. I don’t know yet whether Keane can be improved or easily generalized.

Given an integer x ≥ 0, let q = ⌊x/255⌋ (in C, q = x / 255;) and r = x − 255 q (in C, r = x % 255;) so that q ≥ 0 and 0 ≤ r < 255 are integers and x = 255 q + r.

Adrian Mole’s method

This method evaluates (x + ⌊x/255⌋) mod 2⁸ (in C, (x + x / 255) & 0xff), which equals (255 q + r + q) mod 2⁸ = (2⁸ q + r) mod 2⁸ = r.

Henry S. Warren’s method

Note that x + ⌊x/255⌋ = ⌊x + x/255⌋ = ⌊(2⁸/255) x⌋, where the first step follows from x being an integer. This method uses the multiplier (2⁰ + 2⁻⁸ + 2⁻¹⁶ + 2⁻²⁴ + 2⁻³²) instead of 2⁸/255, which is the sum of the infinite series 2⁰ + 2⁻⁸ + 2⁻¹⁶ + 2⁻²⁴ + 2⁻³² + …. Since the approximation is slightly under, this method must detect the residue 2⁸ − 1 = 255.

Joe Keane’s method

The intuition for this method is to compute y = (2⁸/255) x mod 2⁸, which equals (2⁸/255) (255 q + r) mod 2⁸ = (2⁸ q + (2⁸/255) r) mod 2⁸ = (2⁸/255) r, and return y − y/2⁸, which equals r.

Since these formulas don’t use the fact that ⌊(2⁸/255) r⌋ = r, Keane can switch from 2⁸ to 2¹⁰ for two guard bits. Ideally, these would always be zero, but due to fixed-point truncation and an approximation for 2¹⁰/255, they’re not. Keane adds 2 to switch from truncation to rounding, which also avoids the special case in Warren.

This method sort of uses the multiplier 2² (2⁰ + 2⁻⁸ + 2⁻¹⁶ + 2⁻²⁴ + 2⁻³² + 2⁻⁴⁰) = 2² (2⁰ + 2⁻¹⁶ + 2⁻³²) (2⁰ + 2⁻⁸). The C statement x = (((x >> 16) + x) >> 14) + (x << 2); computes x′ = ⌊2² (2⁰ + 2⁻¹⁶ + 2⁻³²) x⌋ mod 2³². Then ((x >> 8) + x) & 0x3ff is x′′ = ⌊(2⁰ + 2⁻⁸) x′⌋ mod 2¹⁰.

I don’t have time right now to do the error analysis formally. Informally, the error interval of the first computation has width < 1; the second, width < 2 + 2⁻⁸; the third, width < ((2 − 2⁻⁸) + 1)/2² < 1, which allows correct rounding.

Regarding improvements, the 2⁻⁴⁰ term of the approximation seems not necessary (?), but we might as well have it unless we can drop the 2⁻³² term. Dropping 2⁻³² pushes the approximation quality out of spec.

score 9 · Answer 3 · answered Jun 22 '21 at 05:35

9

Guess you're probably not looking for solutions that require fast 64-bit multiplication, but for the record:

return (x * 0x101010101010102ULL) >> 56;

answered Jun 22 '21 at 05:35

David Eisenstat

64,237
7
60
120

how did you get that number? Does an equivalent exist for all other power-of-two minus one values? – Noah Feb 27 '22 at 06:50
1

@Noah It's (2^8)/(2^8-1) in 8.56 fixed point, rounded up. I can't reproduce the error analysis in my other answers off the top of my head, but there should be an equivalent for other exponents. [Lemire](https://lemire.me/blog/2019/02/08/faster-remainders-when-the-divisor-is-a-constant-beating-compilers-and-libdivide/) has more details and generalizes to other divisors. – David Eisenstat Feb 27 '22 at 13:02

David Eisenstat · Answer 4 · 2021-06-22T23:00:58.750

7

This method (improved slightly since the previous edit) mashes up Warren and Keane. On my laptop, it’s faster than Keane but not as fast as a 64-bit multiply and shift. It avoids multiplication but benefits from a single rotate instruction. Unlike the original version, it’s probably OK on RISC-V.

Like Warren, this method approximates ⌊(256/255) x mod 256⌋ in 8.24 fixed point. Mod 256, each byte b contributes a term (256/255) b, which is approximately b.bbb base 256. The original version of this method just sums all four byte rotations. (I’ll get to the revised version in a moment.) This sum always underestimates the real value, but by less than 4 units in the last place. By adding 4/2⁻²⁴ before truncating, we guarantee the right answer as in Keane.

The revised version saves work by relaxing the approximation quality. We write (256/255) x = (257/256) (65536/65535) x, evaluate (65536/65535) x in 16.16 fixed point (i.e., add x to its 16-bit rotation), and then multiply by 257/256 and mod by 256 into 8.24 fixed point. The first multiplication has error less than 2 units in the last place of 16.16, and the second is exact (!). The sum underestimates by less than (2/2¹⁶) (257/256), so a constant term of 514/2²⁴ suffices to fix the truncation. It’s also possible to use a greater value in case a different immediate operand is more efficient.

uint32_t mod255(uint32_t x) {
  x += (x << 16) | (x >> 16);
  return ((x << 8) + x + 514) >> 24;
}

edited Jun 22 '21 at 23:00

answered Jun 22 '21 at 16:09

David Eisenstat

64,237
7
60
120

Were you testing throughput or latency? (e.g. `x = mod255(x) + 0xabcdef` in a loop for latency, `tmp = mod255(x)` / `Benchmark::DoNotOptimize(tmp)` and `x` for throughput) What compiler/CPU? https://godbolt.org/z/Wohrenbd8 has compiler output for a few versions, showing that x86-64 benefits from BMI2 `rorx` to copy-and-rotate in one instruction, instead of needing to `mov` / `ror`. – Peter Cordes Jun 22 '21 at 16:38
@Peter on mobile, so hard to read Godbolt, but I wasn't terribly concerned with x64 performance given the better alternatives on that platform. – David Eisenstat Jun 22 '21 at 17:35
2

@DavidEisenstat Excellent! It is faster than `KEANE_MAGIC` on my x86, and just 5 instructions on 32-bit ARM (both clang and gcc). This should also be very efficient on GPUs which provide an efficient byte rotate via a byte-shuffle instruction (I'll have to see whether the compiler can generate it in this case). I assume this is related to digit summing? Looking forward to the explanation (it looks related to digit summing). – njuffa Jun 22 '21 at 19:35
On old GPUs this does indeed map to `PRMT` for the rotates but on modern GPUs the CUDA compiler gets this down to five instructions by using the `LEA.HI` instruction. I like it. – njuffa Jun 22 '21 at 21:28
@njuffa Explained and revised. Let me know if the new version is a regression on anything you're testing on. – David Eisenstat Jun 22 '21 at 22:12
2

@DavidEisenstat From what I can tell right now, the revised version performs greater-or-equal than the previous ROL-based one on ARM and x86. It maps to only four instructions for my Turing-architecture GPU. – njuffa Jun 22 '21 at 22:24
Note that RISC-V only has rotate in extension-B (which is still experimental). https://godbolt.org/z/7vz6YoPrx shows clang for RV32 with/without `-menable-experimental-extensions -march=rv32imafdcb0p93` (rv32i + a bunch of extensions including b revision 0.93) – Peter Cordes Jun 22 '21 at 23:25
@PeterCordes I would have thought the modern way in processor architectures is to provide a funnel shifter, with bit-rotates as a trivial subset of its functionality. Or is RISC-V targeted at low-end designs only, where a three-input operation would be prohibitive? – njuffa Jun 23 '21 at 18:48
@njuffa: IDK why RISC-V left out rotates. Perhaps RISC-purity gone wild, like "let's leave out rotate because people can always shift and OR if they want it". But yes I think it was originally aimed at making very compact implementations possible, with rv32i not even including a multiplier apparently. (https://en.wikipedia.org/wiki/RISC-V#ISA_base_and_extensions) Real-world usage of RISC-V is primarily in microcontroller type things, but there are other applications. Patterson is not a fan of CPU-style short vector SIMD, hence only scalar FPU extensions being ratified so far, not SIMD. – Peter Cordes Jun 23 '21 at 18:56
(But there is an ARM-SVE style vector extension proposal to let software take advantage of whatever vector width the HW has, with masked loads and stuff. Looks pretty nice. As well as a separate packed-SIMD extension proposal, for traditional x86 SSE style fixed vector width) – Peter Cordes Jun 23 '21 at 18:56
@PeterCordes A funnel-shift instruction is very much pure RISC in my book, as it can replace separate SLL, SRL, SRA, ROL, ROR instructions. I'll stop my OT thoughts here. – njuffa Jun 23 '21 at 19:09
1

@DavidEisenstat All three answers are helpful/useful, so I had to pick one to attach the bonus to. I chose the one with the largest immediate utility. – njuffa Jun 24 '21 at 20:46

Aki Suihkonen · Answer 5 · 2021-06-24T07:23:51.027

If we were to have a builtin, intrinsic, or method that is optimised to single instruction addc, one could use 32-bit arithmetic in the following way:

uint32_t carry = 0;
// sum up top and bottom 16 bits while generating carry out
x = __builtin_addc(x, x<<16, carry, &carry);
x &= 0xffff0000;
// store the previous carry to bit 0 while adding
// bits 16:23 over bits 24:31, and producing one more carry
x = __builtin_addc(x, x << 8, carry, &carry);  
x = __builtin_addc(x, x >> 24, carry, &carry);  
x &= 0x0000ffff;   // actually 0x1ff is enough
// final correction for 0<=x<=257, i.e. min(x,x-255)
x = x < x-255 ? x : x - 255;

In Arm64 at least the regular add instruction can take the form of add r0, r1, r2 LSL 16; the masking with immediate or clearing consecutive bits is a single instruction bfi r0, wzr, #start_bit, #length.

For parallel calculation one can't use that efficiently widening multiplication. Instead one can divide-and-conquer while calculating carries -- starting with 16 uint32_t elements interpreted as 16+16 uint16_t elements, then moving to uint8_t arithmetic, one can calculate one result in slightly less than one instruction.

a0 = vld2q_u16(ptr);     // split input to top16+bot16 bits
a1 = vld2q_u16(ptr + 8); // load more inputs
auto b0 = vaddq_u16(a0.val[0], a0.val[1]);
auto b1 = vaddq_u16(a1.val[0], a1.val[1]);
auto c0 = vcltq_u16(b0, a0.val[1]); // 8 carries
auto c1 = vcltq_u16(b1, a1.val[1]); // 8 more carries
b0 = vsubq_u16(b0, c0);
b1 = vsubq_u16(b1, c1);
auto d = vuzpq_u8(b0, b1);
auto result = vaddq_u8(d.val[0], d.val[1]);
auto carry = vcltq_u8(result, d.val[1]);
result = vsubq_u8(result, carry);
auto is_255 = vceqq_u8(result, vdupq_n_u8(255));
result = vbicq_u8(result, is_255);

I don't generally have an intrinsic for single-instruction `addc` available to me (on x86 there is `_addcarry_u64` which might be applicable) , but I used my traditional emulation for it to confirm that the code at the top of this answer works correctly. `#define ADDCcc(a,b,cy,t0,t1) (t0=(b)+cy, t1=(a), cy=t0 — njuffa, Jun 23 '21 at 16:08
Clang supports `uint32_t __builtin_addc(uint32_t a, uint32_t b, uint32_t c_in, uint32_t *c_out)` for 32-bit addition, however the generated code is not promising. — Aki Suihkonen, Jun 24 '21 at 07:16

Efficient modulo-255 computation

5 Answers5

Adrian Mole’s method

Henry S. Warren’s method

Joe Keane’s method