Fast modulo-12 algorithm for 4 uint16_t's packed in a uint64_t

Question

Consider the following union:

union Uint16Vect {
    uint16_t _comps[4];
    uint64_t _all;
};

Is there a fast algorithm for determining whether each component equals 1 modulo 12 or not?

A naive sequence of code is:

Uint16Vect F(const Uint16Vect a) {
    Uint16Vect r;
    for (int8_t k = 0; k < 4; k++) {
        r._comps[k] = (a._comps[k] % 12 == 1) ? 1 : 0;
    }
    return r;
}

Yes, it's currently the slowest part of my multithreaded program with AVX2. — Serge Rogatch, Feb 16 '19 at 17:44
Useless use of ternary -- the `==` operator already yields `1` for true and `0` for false, when widened to an integral type. — Ben Voigt, Feb 16 '19 at 18:33
According to my calculations, if you multiply your 16-bit number by **43691** (0xAAAB) and shift down by **19**, you get the same result as dividing by 12. If you multiply this by 12 and subtract that from your original number, should give you your "mod12". The intermediate results fit in 32 bits, so you probably can do 4 of them in 128-bit registers. I don't know whether you can do 8 at a time using AVX, etc. but should be easy to find out (seems like you have experience there.) The real question is whether this saves you any time though... I think it should, if it's vectorized. — yzt, Feb 16 '19 at 19:34
If you're fine with it not working with 65533 then this is a solution for 8x16 in __m128i. It's not what you want, but someone may craft something from it. https://godbolt.org/z/lNwfl1, relies partially on https://gcc.godbolt.org/z/yqc5u2, the rest is a hacky way to make it work without having proper 32 bit multiplication (the division trick only works with 2x wide integers). So there is probably a good way to do it with 16b*16b=32b instructions with sse to get 4x16b in __m128i, or using AVX for more (but you need unsigned mul to do it properly, which I can't find), but I don't have time now. — Sopel, Feb 17 '19 at 00:15
@BenVoigt any modern optimizer will optimize out the redundant "convert 0 to 0 and 1 to 1" operation. — Raymond Chen, Feb 17 '19 at 00:22
@yzt: I think you'll find that most compilers will do that optimisation for you, as long as the dividend is a constant. — rici, Feb 17 '19 at 01:18
As mentioned, we can multiply by 0xAAAB, shift down 19 bits, multiply by 12, subtract from the number, and compare to one, and GCC 8.2 generates such code. We can also use `(x&3) == 1 && 0xAAAB <= (uint16_t) ((unsigned) x * 0xAAAB)`, which simplifies the arithmetic a bit but introduces an extra comparison. These should be implementable in AVX code, and further simplifications may be possible. — Eric Postpischil, Feb 17 '19 at 02:45
@RaymondChen: Yes it will, but it doesn't save programmers from having to read the extra code, think it's pointless, wonder what the trick is, and finally decide it's just totally redundant. When the optimizer doesn't care, the code should be written for maximum readability (and maybe even when the optimizer does care, if it isn't a performance-critical part of the code) — Ben Voigt, Feb 17 '19 at 05:32
@Ben there are those who argue the explicit `bool-expression ? 1 : 0` is more readable because it doesn't rely on a pun (namely that the result of comparison operators are not actually booleans but just integers 0 or 1 exactly). In C++ this is an implicit type conversion. In other languages, you'll get a type mismatch, or the result might even be `-1`. — Raymond Chen, Feb 17 '19 at 15:54

phuclv · Answer 1 · 2020-11-05T11:23:02.223

Compilers will optimize division by a constant to a multiplication by the reciprocal or multiplicative inverse. For example x/12 will be optimized to x*43691 >> 19

bool h(uint16_t x)
{
    return x % 12 == 1;
}
h(unsigned short):
        movzx   eax, di
        imul    eax, eax, 43691 ; = 0xFFFF*8/12 + 1
        shr     eax, 19
        lea     eax, [rax+rax*2]
        sal     eax, 2
        sub     edi, eax
        cmp     di, 1
        sete    al
        ret

Because there are multiplication instructions in SSE/AVX, this can easily be vectorized. Besides, x = (x % 12 == 1) ? 1 : 0; can be simplified to x = (x % 12 == 1) and then transformed to x = (x - 1) % 12 == 0 which avoids a load of the value 1 from the constant table to compare. You can use the vector extension so that gcc automatically generates code for you

typedef uint16_t ymm32x2 __attribute__((vector_size(32)));
ymm32x2 mod12(ymm32x2 x)
{
    return !!((x - 1) % 12);
}

Below is the output from gcc

mod12(unsigned short __vector(16)):
        vpcmpeqd    ymm3, ymm3, ymm3  ; ymm3 = -1
        vpaddw      ymm0, ymm0, ymm3
        vpmulhuw    ymm1, ymm0, YMMWORD PTR .LC0[rip] ; multiply with 43691
        vpsrlw      ymm2, ymm1, 3
        vpsllw      ymm1, ymm2, 1
        vpaddw      ymm1, ymm1, ymm2
        vpsllw      ymm1, ymm1, 2
        vpcmpeqw    ymm0, ymm0, ymm1
        vpandn      ymm0, ymm0, ymm3
        ret

Clang and ICC don't support !! on vector types so you need to change to (x - 1) % 12 == 0. Unfortunately it seems that compilers don't support __attribute__((vector_size(8)) to emit MMX instructions. But nowadays you should use SSE or AVX anyway

The output for x % 12 == 1 is shorter as you can see in the same Godbolt link above, but you need a table containing 1s to compare, which may be better or not. It's also possible that the compiler couldn't optimize fully as hand-written code so you can try to vectorize the code manually using intrinsics. Check which one works faster in your case

A better way is ((x * 43691) & 0x7ffff) < 43691, or x * 357913942 < 357913942 as mentioned in nwellnhof's answer which should also be easy to vectorize

Alternatively for a small input range like this you can use a lookup table. The basic version needs a 65536-element array

#define S1(x) ((x) + 0) % 12 == 1, ((x) + 1) % 12 == 1, ((x) + 2) % 12 == 1, ((x) + 3) % 12 == 1, \
              ((x) + 4) % 12 == 1, ((x) + 4) % 12 == 1, ((x) + 6) % 12 == 1, ((x) + 7) % 12 == 1
#define S2(x) S1((x + 0)*8), S1((x + 1)*8), S1((x + 2)*8), S1((x + 3)*8), \
              S1((x + 4)*8), S1((x + 4)*8), S1((x + 6)*8), S1((x + 7)*8)
#define S3(x) S2((x + 0)*8), S2((x + 1)*8), S2((x + 2)*8), S2((x + 3)*8), \
              S2((x + 4)*8), S2((x + 4)*8), S2((x + 6)*8), S2((x + 7)*8)
#define S4(x) S3((x + 0)*8), S3((x + 1)*8), S3((x + 2)*8), S3((x + 3)*8), \
              S3((x + 4)*8), S3((x + 4)*8), S3((x + 6)*8), S3((x + 7)*8)

bool mod12e1[65536] = {
    S4(0U), S4(8U), S4(16U), S4(24U), S4(32U), S4(40U), S4(48U), S4(56U)
}

To use just replace x % 12 == 1 with mod12e1[x]. This can of course be vectorized

But since the result is only 1 or 0, you can also use a 65536-bit array to reduce the size to only 8KB

You can also check divisibility by 12 by divisibility by 4 and 3. Divisibility by 4 is obviously trivial. Divisibility by 3 can be calculated by multiple ways

One is calculating the difference between the sum of the odd digits and the sum of the even digits like in גלעד ברקן's answer and check if it's divisible by 3 or not

Or you can check whether the sum of the digits in base 2^2k (like base 4, 16, 64...) to see if it's divisible by 3 or not.

That works because in base b to check divisibility of any divisors n of b - 1, just check if the sum of the digits is divisible by n or not. Here's an implementation of it

  void modulo12equals1(uint16_t d[], uint32_t size) {
      for (uint32_t i = 0; i < size; i++)
      {
          uint16_t x = d[i] - 1;
          bool divisibleBy4 = x % 4 == 0;
          x = (x >> 8) + (x & 0x00ff); // max 1FE
          x = (x >> 4) + (x & 0x000f); // max 2D
          bool divisibleBy3 = !!((01111111111111111111111ULL >> x) & 1);
          d[i] = divisibleBy3 && divisibleBy4;
      }
  }

Credits for the divisibility by 3 to Roland Illig

Since the auto-vectorized assembly output is too long, you can check it on the Godbolt link

My first instinct was a lookup table, which I suggested in the comments and retracted thinking it might not be as fast as operations. — גלעד ברקן, Feb 17 '19 at 11:04
Wouldn't something like for(i=1;i<65536;i+=12) mod12e1[i]=true; be better for the lookup table? Because it doesn't use the % operator. — kingW3, Feb 17 '19 at 12:30
@kingW3 my table is constructed at compile time, while yours is populated at runtime — phuclv, Feb 17 '19 at 13:48

score 2 · Answer 2 · answered Feb 16 '19 at 22:30

2

If it would help to limit operations to bit operations and popcount, we can observe that a valid candidate must pass two tests since subtracting 1 must mean divisibility by 4 and 3. First, the last two bits must be 01. Then divisibility by 3, which we can find by subtracting the odd-positioned popcount from the even-positioned popcount.

const evenMask = parseInt('1010101010101010', 2);
// Leave out first bit, we know it will be zero
// after subtracting 1
const oddMask = parseInt('101010101010100', 2);

console.log('n , Test 1: (n & 3)^3, Test 2: popcount diff:\n\n');

for (let n=0; n<500; n++){
  if (n % 12 == 1)
    console.log(
      n,
      (n & 3)^3,
      popcount(n & evenMask) - popcount(n & oddMask))
}

// https://stackoverflow.com/questions/43122082/efficiently-count-the-number-of-bits-in-an-integer-in-javascript
function popcount(n) {
  var tmp = n;
  var count = 0;
  while (tmp > 0) {
    tmp = tmp & (tmp - 1);
    count++;
  }
  return count;
}

answered Feb 16 '19 at 22:30

גלעד ברקן

23,602
3
25
61

the unfortunate thing is that [there's no `popcnt` for SIMD registers](https://stackoverflow.com/q/3693981/995714) – phuclv Feb 17 '19 at 05:29
@phuclv thank you for your comment. I don't know too much about it but the OP mentioned AVX2 - could these articles (and others) about popcount be relevant? https://lemire.me/en/publication/arxiv1611.07612/ and https://news.ycombinator.com/item?id=11277891 – גלעד ברקן Feb 17 '19 at 06:00
Yes you *can* SIMD popcount with some extra instructions (or with AVX512VPOPCNTDQ for dword elements). You can probably replace `vpsadbw` (horizontal 64-bit add) with a horizontal add into 16-bit elements (e.g. `pmaddubsw`) to get per-16-bit-chunk popcnt. If you had [AVX512BITALG for native `vpopcntw`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2,AVX,AVX2,FMA,AVX_512,Other&text=vpop&expand=4353,4344) (per-16-bit-element SIMD popcnt) it might even be faster than the normal fixed-point multiplicative inverse using high-half multiply. – Peter Cordes Oct 29 '19 at 22:36

Matt Timmermans · Answer 3 · 2019-02-17T14:29:47.397

This is the best I could come up with

uint64_t F(uint64_t vec) {
    //512 = 4 mod 12  -> max val 0x3FB
    vec = ((vec & 0xFE00FE00FE00FE00L) >> 7) + (vec & 0x01FF01FF01FF01FFL);
    //64 = 4 mod 12 -> max val 0x77
    vec = ((vec & 0x03C003C003C003C0L) >> 4) + (vec & 0x003F003F003F003FL);
    //16 = 4 mod 12 -> max val 0x27
    vec = ((vec & 0x0070007000700070L) >> 2) + (vec & 0x000F000F000F000FL);
    //16 = 4 mod 12 -> max val 0x13
    vec = ((vec & 0x0030003000300030L) >> 2) + (vec & 0x000F000F000F000FL);
    //16 = 4 mod 12 -> max val 0x0f
    vec = ((vec & 0x0030003000300030L) >> 2) + (vec & 0x000F000F000F000FL);

    //Each field is now 4 bits, and only 1101 and 0001 are 1 mod 12.
    //The top 2 bits must be equal and the other2 must be 0 and 1

    return vec & ~(vec>>1) & ~((vec>>2)^(vec>>3)) & 0x0001000100010001L;
}

Is the return supposed to have assignment (`=`) or equality (`==`) ? — Ben Voigt, Feb 17 '19 at 05:36

nwellnhof · Answer 4 · 2019-02-17T16:50:37.807

2

There's a recent post on Daniel Lemire's blog on fast remainder computation and divisibility checks. For example, you can check for divisibility by 12 with ((x * 43691) & 0x7ffff) < 43691, or assuming 32-bit operations with x * 357913942 < 357913942. This should be easy to parallelize but it requires 32-bit multiplications unlike the code in phuclv's answer.

edited Feb 17 '19 at 16:50

answered Feb 17 '19 at 15:01

nwellnhof

32,319
7
89
113

Ben Voigt · Answer 5 · 2019-02-16T18:43:58.933

0

Note that x mod 12 == 1 implies that x mod 4 == 1, and the latter is extremely cheap.

So:

is_mod_12 = ((input & 3) == 1) && ((input % 12) == 1);

will, in the case where input mod 4 is frequently not 1, save you a lot of modulo operations. However if input mod 4 usually is 1, this will make performance ever so slightly worse.

edited Feb 16 '19 at 18:43

answered Feb 16 '19 at 18:32

Ben Voigt

277,958
43
419
720

There are missing parentheses around `&`. And as modulo is optimized to multiplication, this added `&` operation might be a de-optimization (even if the input is evenly distributed - I'm not sure, it need to be benchmarked). – geza Feb 16 '19 at 18:40
1

@geza: The advantage of bitwise AND over multiplication is smaller than over division, but it is still significant. – Ben Voigt Feb 16 '19 at 18:44
4

But there is an additional branch, which can be very expensive (both gcc and clang compile this to a branch). The performance of this recommendation highly depends on the input. If the branch cannot be predicted well, this code runs much slower than the original one. – geza Feb 16 '19 at 18:49
@geza: Sure, I acknowledged the potential deoptimization in the very first version of this answer. But the branch has nothing to do with whether the modulo has been transformed to use a multiply instruction. – Ben Voigt Feb 16 '19 at 18:50
The OP is looking for SIMD with AVX2 (but failed to tag it that way). That means doing everything *branchlessly*, so we'd always need the `% 12` result :/ You could try checking if `(input & 3) == 1` was false for all 16x 2-byte elements as an early-out to skip the `%12`, but depending on the data that's unlikely to skip many full vectors. The wider your vectors and the smaller your elements, the less often an early-out works for all elements in the vector. – Peter Cordes Oct 29 '19 at 22:41

Fast modulo-12 algorithm for 4 uint16_t's packed in a uint64_t

5 Answers5

See also