Bit trick to detect if any of some integers has a specific value

Question

Is there any clever bit trick to detect if any of a small number of integers (say 3 or 4) has a specific value?

The straightforward

bool test(int a, int b, int c, int d)
{
    // The compiler will pretty likely optimize it to (a == d | b == d | c == d)
    return (a == d || b == d || c == d);
}

in GCC compiles to

test(int, int, int, int):
        cmp     ecx, esi
        sete    al
        cmp     ecx, edx
        sete    dl
        or      eax, edx
        cmp     edi, ecx
        sete    dl
        or      eax, edx
        ret

Those sete instructions have higher latency than I want to tolerate, so I would rather use something bitwise (&, |, ^, ~) stuff and a single comparison.

@EvanTeran Not yet. I would keep it as platform independent as possible. — plasmacel, Jul 24 '17 at 21:15
@Austin_Anderson Most compilers will optimize such expressions to that form, and they emit the same instructions. You can try it on gcc.godbolt.org — plasmacel, Jul 24 '17 at 21:18
Is there anything that make the problem easier? Like limited interval of `int` values? — geza, Jul 24 '17 at 21:28
@geza These integers will be actually pointers, so I don't think that we can make any assumptions. — plasmacel, Jul 24 '17 at 21:29
I don't know if it's particularly fast, but here's an example using gcc SIMD intrinsics. May we worth bench marking: https://godbolt.org/g/i2or76 — Evan Teran, Jul 24 '17 at 21:33
@EvanTeran what if r[0] != 0 and r[1] != 0, but r[0] & r[1] is 0? — Iłya Bursov, Jul 24 '17 at 21:34
@EvanTeran Example test case that meets Ilya's condition: `test(2, 1, 1, 0)` — Justin, Jul 24 '17 at 21:40
on which CPU have you problems with `sete`, btw.? Recent Intel CPUs has a latency of one for `sete`. — geza, Jul 24 '17 at 21:50
Do the arguments have to be passed as scalars? That pretty much kills SIMD right from the start, there is no reasonable way to gather them together in a vector register. — harold, Jul 24 '17 at 22:46
@Justin Here's a fixed version, I was trying to be too clever, should have just used a direct compare! https://godbolt.org/g/yyPrv3 — Evan Teran, Jul 24 '17 at 23:12
@harold, agreed (which is why i posted as a comment, not an answer), but maybe if it can be demonstrated that SIMD would be superior, it would make it worth while to do more of the larger computation in SSE registers making the algorithm as whole faster. — Evan Teran, Jul 24 '17 at 23:15
@EvanTeran you can do a bit better in SIMD than that btw, for example using `{d, d, d, d}` makes it a bit shorter, and so does using `_mm_movemask_ps` to get the result instead of extract/scalar-OR but I see no way to make the abc part not horrible — harold, Jul 24 '17 at 23:23
@harold, Good call! I generally agree about the loading of the data, it really kills how much it ends up being worth it. Oh well. — Evan Teran, Jul 24 '17 at 23:25
The problem with this code is not the `SETE` instructions, *per se*, but the partial register stalls that very likely come from writing byte-sized registers and then trying to read from the full dword-sized registers. I have no idea why GCC is making this inane mistake. Clang doesn't, nor does MSVC; they `OR` only the byte-sized registers. Looks like a bug for GCC, since it emits the same code even when targeting older architectures that *definitely* had partial register stalls here. Should probably be filed with the GCC folks. — Cody Gray - on strike, Jul 25 '17 at 13:27
The best answer probably depends on the distribution of `a`, `b`, `c`, and `d`. For example, if they are all uniformly randomly distributed, the shortcut check `((a^b) & (b^d) & (c^d)) == 0` will be very fast and will never produce a false negative (i.e., say that there was no match where there was). You can get rare false positives, where this equals `0` but you just back it up with a second check (e.g., just checking the `xor` values one by one). Of course, your numbers are probably _not_ uniformly randomly distributed! — BeeOnRope, Jul 25 '17 at 19:42
@CodyGray Thanks for mentioning. I will report it to GCC if I get my Bugzilla account. — plasmacel, Jul 26 '17 at 07:25
@CodyGray Do you have a GCC Bugzilla account to report the issue? Registration is closed now, you have to request an account in email, but all I got in response is a mailer daemon generated message that the message cannot be delivered. — plasmacel, Jul 29 '17 at 12:38
I have a Bugzilla. I'd be happy to report it. Not sure what is happening for you, though. Registration was "closed" when I created my account, too. I still got the email within a few hours. Anyway, it's on my to-do list. :-) — Cody Gray - on strike, Jul 30 '17 at 09:45
Okay, done. [Bug officially reported](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81614). — Cody Gray - on strike, Jul 30 '17 at 11:30
@plasmacel - is the code being used in a loop? When performance for such a low level operation matters, I would assume that it is, and if so there may be more optimization opportunities. What are the relative probabilities of `(a == d)`, `(b == d)` and `(c == d)`? — BeeOnRope, Jul 31 '17 at 16:41
@BeeOnRope Yeah, it is used in a loop. Well, the distribution really depends on the input, and its pretty much indeterminate. — plasmacel, Jul 31 '17 at 17:02
Can you show a reasonable facsimile of the outer loop? Most of the answers here, while good, may be way off track when you consider this method will be inlined into the loop which may dramatically change the generated code. You should try to take a stab at the distribution question. Even if it is "sometimes it's like this" and "sometimes it's like that" it would be fine and the best approach may be to dynamically select between two or more implementations at runtime. For example, branching is very good if the answer is usually the same. — BeeOnRope, Jul 31 '17 at 17:05

Iłya Bursov · Accepted Answer · 2017-07-25T20:02:46.260

5

The only solution I've found yet is:

int s1 = ((a-d) >> 31) | ((d-a) >> 31);
int s2 = ((b-d) >> 31) | ((d-b) >> 31);
int s3 = ((c-d) >> 31) | ((d-c) >> 31);

int s = s1 & s2 & s3;
return (s & 1) == 0;

alternative variant:

int s1 = (a-d) | (d-a);
int s2 = (b-d) | (d-b);
int s3 = (c-d) | (d-c);

int s = (s1 & s2 & s3);
return (s & 0x80000000) == 0;

both are translated to:

mov     eax, ecx
sub     eax, edi
sub     edi, ecx
or      edi, eax
mov     eax, ecx
sub     eax, esi
sub     esi, ecx
or      esi, eax
and     esi, edi
mov     eax, edx
sub     eax, ecx
sub     ecx, edx
or      ecx, eax
test    esi, ecx
setns   al
ret

which has less sete instructions, but obviously more mov/sub.

Update: as BeeOnRope@ suggested - it makes sense to cast input variables to unsigned

edited Jul 25 '17 at 20:02

answered Jul 24 '17 at 22:00

Iłya Bursov

23,342
4
33
57

alternate, almost same asm: `int s1 = (a-d) | (d-a); .... ; int s = s1 & s2 & s3; return (s & 0x80'00'00'00'00) != 0x80'00'00'00'00;` – bolov Jul 24 '17 at 22:20
@bolov nice catch, actually it looks like compiler already applied this optimization – Iłya Bursov Jul 24 '17 at 22:21
2

Did you actually test and see that this was faster? I suspect it will not be. `SETcc` instructions are not especially slow on modern Intel CPUs. (Although, that code in the question has potential partial-register stalls, depending on which processor you're running it on. I'm very surprised to see GCC generating code like that. It should know better!) – Cody Gray - on strike Jul 25 '17 at 13:24
2

Signed over/underflow is undefined behavior in C++, and this is likely to trigger it if the numbers are large in magnitude. You could cast to unsigned... – BeeOnRope Jul 25 '17 at 19:45
@BeeOnRope how unsigned could help here? as soon as here we have 2 subtractions - at least one of them will have sign bit for different numbers even with overflow (for x86 of course) – Iłya Bursov Jul 25 '17 at 19:55
1

Yes, but underflow or overflow is _undefined behavior_ for signed values, but has the defined behavior you want for unsigned. So the code above could crash, return a wrong value, whatever on some implementation (and indeed, modern compilers do already generate code that does the wrong thing on overflow, although perhaps not yet in this case). Casting to unsigned means you play within the rules and produce legal code for all values. – BeeOnRope Jul 25 '17 at 20:00
@BeeOnRope ok, actually for signed it is not ub, but implementation defined, anyway let it be unsigned – Iłya Bursov Jul 25 '17 at 20:04
@IlyaBursov - nope, it is well known [that it is UB](https://stackoverflow.com/q/18195715/149138), and compilers rely on it (e.g., assuming it can _never_ happen to simply signed loop bounds). – BeeOnRope Jul 25 '17 at 20:05
@BeeOnRope it is https://stackoverflow.com/questions/18994563/is-unsigned-underflow-and-signed-overflow-defined-behaviour `Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.` – Iłya Bursov Jul 25 '17 at 20:06
@IlyaBursov wrong. There's a reason that answer has 0 net upvotes, and the one I linked has 100+. Here's [another one](https://stackoverflow.com/questions/16188263/is-signed-integer-overflow-still-undefined-behavior-in-c). The one you linked to incorrectly quotes the part about _integer conversion_, and **not** about integer overflow. Conversion is implementation-defined, which is exactly why my casting suggestion works. Signed overflow is **undefined** and this is a famous and well known fact since it lead to some very interesting bugs. A good [read](https://blog.regehr.org/archives/1401). – BeeOnRope Jul 25 '17 at 20:12
@IlyaBursov - actually the issue with the one you linked so is mostly a problem with the way the question was posed. The title talks about "signed overflow" but the example in the code is actually about signed to unsigned _conversion_. So the answer there is correct for the given code, but the title is very misleading. To summarize: signed under/overflow (as is possible in your subtraction) is UB, but signed to unsigned _conversion_ (either direction) is implementation-defined and I'm not aware of any compiler today that does it in a way other than the "obvious" bitwise no-op conversion. – BeeOnRope Jul 25 '17 at 20:23

score 3 · Answer 2 · edited Jul 27 '17 at 20:27

3

It is not a full bit trick. Any zero yields a zero product, which gives a zero result. Negate 0 yields a 1. Does not deal with overflow.

bool test(int a, int b, int c, int d)
{
    return !((a^d)*(b^d)*(c^d));
}

gcc 7.1 -O3 output. (d is in ecx, the other inputs start in other integer regs).

    xor     edi, ecx
    xor     esi, ecx
    xor     edx, ecx
    imul    edi, esi
    imul    edx, edi
    test    edx, edx
    sete    al
    ret

It might be faster than the original on Core2 or Nehalem where partial-register stalls are a problem. imul r32,r32 has 3c latency on Core2/Nehalem (and later Intel CPUs), and 1 per clock throughput, so this sequence has 7 cycle latency from the inputs to the 2nd imul result, and another 2 cycles of latency for test/sete. Throughput should be fairly good if this sequence runs on multiple independent inputs.

Using a 64-bit multiply would avoid the overflow problem on the first multiply, but the second could still overflow if the total is >= 2**64. It would still be the same performance on Intel Nehalem and Sandybridge-family, and AMD Ryzen. But it would be slower on older CPUs.

In x86 asm, doing the second multiply with a full-multiply one-operand mul instruction (64x64b => 128b) would avoid overflow, and the result could be checked for being all-zero or not with or rax,rdx. We can write that in GNU C for 64-bit targets (where __int128 is available)

bool test_mulwide(unsigned a, unsigned b, unsigned c, unsigned d)
{
    unsigned __int128 mul1 = (a^d)*(unsigned long long)(b^d);
    return !(mul1*(c^d));
}

and gcc/clang really do emit the asm we hoped for (each with some useless mov instructions):

   # gcc -O3 for x86-64 SysV ABI
    mov     eax, esi
    xor     edi, ecx
    xor     eax, ecx
    xor     ecx, edx   # zero-extends
    imul    rax, rdi
    mul     rcx        # 64 bit inputs (rax implicit), 128b output in rdx:rax
    mov     rsi, rax   # this is useless
    or      rsi, rdx
    sete    al
    ret

This should be almost as fast as the simple version that can overflow, on modern x86-64. (mul r64 is still only 3c latency, but 2 uops instead of 1 for imul r64,r64 that doesn't produce the high-half), on Intel Sandybridge-family.)

It's still probably worse than clang's setcc/or output from the original version, which uses 8-bit or instructions to avoid reading 32-bit registers after writing the low byte (i.e. no partial-register stalls).

See both sources with both compilers on the Godbolt compiler explorer. (Also included: @BeeOnRope's ^ / & version that risks false positives, with and without a fallback to a full check.)

edited Jul 27 '17 at 20:27

Peter Cordes

328,167
45
605
847

answered Jul 24 '17 at 21:57

EvilTeach

28,120
21
85
141

It's an interesting problem. I wonder if you could do something with assuming 32 bit ints, and using 64 bit ints to do a little parallel instead of full SSE. – EvilTeach Jul 24 '17 at 22:00
What with undefined behavior, here's a test case that fails: `1 << 30, 1 << 30, 1 << 30, 0`: http://coliru.stacked-crooked.com/a/68de37a69c0dfba2 – Justin Jul 24 '17 at 22:03
3

It seems *extremely* unlikely that 2 `imul` instructions will be faster than 2 `sete` instructions. – Cody Gray - on strike Jul 25 '17 at 13:23
@cody. Yep. I agree. – EvilTeach Jul 25 '17 at 14:02
1

How does NAN/INF come into it? These are integers, not floating point? – BeeOnRope Jul 25 '17 at 19:46
@peter. I would think that multiplying two 32 bit ints would not overflow a 64 bit int. Toss another 32 bit int into the mix, and it can overflow. What am I missing? – EvilTeach Jul 27 '17 at 13:05
@EvilTeach: I realized that after making the edit, but didn't go back and fix it. xD. There's a lot more room, though. – Peter Cordes Jul 27 '17 at 13:16
@PeterCordes - you need an extra set of parens to make my example work (because `&` binds weakly). The fixed version generates reasonable [code](https://godbolt.org/g/DHPJyT), especially latency wise (all the xor can happen in parallel, etc). I actually didn't mean you'd accept false positives, but branch to a more complete check if the result was zero, as in the `test_falsepositives2` version shown there. It messed up the code generation though because now it has to do more moves to keep around results it might need after the branch. – BeeOnRope Jul 27 '17 at 20:02
In the real world for a case with a very skewed distribution (i.e., most comparisons usually always false or always true), a series of `cmp ; jne` or `cmp je` comparisons is probably best: they fuse and all the checks happen in parallel, so the fast-path is very good. Try as I might, I could get `gcc` to [produce it](https://godbolt.org/g/UpSQ9H) with _any_ level of optimization, however! Perhaps PGO could do it, I don't know. – BeeOnRope Jul 27 '17 at 20:10
1

@BeeOnRope: derp, I thought I saw the asm was the same after removing the outer `()`, but obviously I missed the change. I updated @Evil's answer with that and a widening multiply version that avoids overflow. – Peter Cordes Jul 27 '17 at 20:29

Bit trick to detect if any of some integers has a specific value

2 Answers2