6

As far as I know, integers in C++ can be treated like booleans, and we can have a code like this:

int a = 6, b = 10;
if (a && b) do something ---> true as both a and b are non-zero

Now, assume that we have:

__m256i a, b;

I need to apply logical_and (&&) for all 4 long variables in __m256i, and return true if one pair is non-zero. I mean something like:

(a[0] && b[0]) || (a[1] && b[1]) || ...

Do we have a fast code in AVX or AVX2 for this purpose?

I could not find any direct instruction for this purpose, and definitely, using the bitwise and (&) also is not the same.

halfer
  • 19,824
  • 17
  • 99
  • 186
  • 1
    Are you actually branching on the result? Is it much more predictable to have one branch, instead of two separate `vptest a,a`/`jz` and `b,b` branches? – Peter Cordes Feb 05 '22 at 21:46
  • 2
    Perhaps something with `_mm_min_epi8` or max? Not easily; `min( {0,1}, {1,0} )` is zero but neither input is zero as a while. Possibly saturating add or sub, but probably not that. I thought for a minute `a+b == b` might be useful, but it only tests if `a==0`; it tells you nothing about `b`. Obvious from how GCC optimizes a scalar version: https://godbolt.org/z/Gq759T5P7. – Peter Cordes Feb 05 '22 at 22:08
  • 1
    @PeterCordes Hi. First of all, many thanks for your time. Actually, I am using it in a loop and if it is zero then it breaks. I am new at SIMD stuff, and I am not sure what approach would be the best. – Mojtaba Valizadeh Feb 05 '22 at 22:33
  • 1
    @Mojtaba Valizadeh What is the allowed range for `a` and `b`? My mind is not clear on this yet, but basically I am thinking that one might utilize some sort of arithmetic approach if both operands are limited to small integers. – njuffa Feb 05 '22 at 22:36
  • 2
    @MojtabaValizadeh Maybe you can show the whole loop, then it's more clear what you are trying to do. it should like you are trying to do a fast-exit out of a SIMD loop. – Martin Berger Feb 05 '22 at 22:47
  • 1
    @MojtabaValizadeh In particular I am thinking along the lines of `a * b != 0` <==> `a && b != 0`. – njuffa Feb 05 '22 at 22:48
  • @njuffa: Oh, interesting. `pmaddubsw` and [`pmaddwd`](https://www.felixcloutier.com/x86/pmaddwd) saturate (and are single-uop), so you might not need range limits. (You definitely want an element-size smaller than 32-bit, since 32-bit integer multiplication is more expensive on most CPUs because the FP mantissa multipliers are only 24 bits wide per 32-bit of SIMD vector.) – Peter Cordes Feb 05 '22 at 22:51
  • @njuffa: multiply doesn't work, same problem as any idea based on vertical per-element operations. `{0,1} * {1,0}` is all-zero, but neither input is all-zero. I think you either need to cmp/movemask each compare result separately (or booleanize with ptest), or do some kind of horizontal max, or otherwise avoid anything that cares whether two corresponding elements are both non-zero. – Peter Cordes Feb 06 '22 at 01:57
  • 3
    For clarification: Do you want something equivalent to `(a[0] && b[0]) || (a[1] && b[1]) ...`? – chtz Feb 06 '22 at 02:05
  • @njuffa Hi. Many thanks for your time. Actually, my numbers can be in any range and there is no limitation, unfortunately. However, I had thought about multiplication but I considered it an expensive one. Thanks again for your comments. – Mojtaba Valizadeh Feb 06 '22 at 13:42
  • @PeterCordes Hi Peter. I think multiplication might be good for most examples. In your example {0, 1} * {1, 0} is all-zero and that is why we reject it and it is correct. The multiplication is non-zero if both are non-zero (if we don't think about the limitation of bits). Thanks again for your time. – Mojtaba Valizadeh Feb 06 '22 at 13:45
  • 1
    So you want to keep looping only if `a` and `b` are non-zero in the same qword, not if either have a non-zero bit anywhere? That's not at all clear from your question, that's why chtz asked. So for example, you'd consider `[0,1,0,0] && [1,0,0,0]` false, even though both vectors are non-zero? If so, then chtz's answer is even better than `vpmaddwd` + `vpcmpeqq` / movemask. – Peter Cordes Feb 06 '22 at 13:48
  • @PeterCordes It was my bad Peter as I am new in both SIMD and English. I really appreciate your time. I am learning from you. – Mojtaba Valizadeh Feb 06 '22 at 14:14
  • 3
    To be fair, your SIMD code did actually do what you wanted. But I wasn't sure if that was a bug or intended, and it took time & effort to follow its logic, since it was doing an ADD on the cmp results + another compare, instead of just `_mm256_or_si256(cmpz1, cmpz2)` / `_mm256_movemask_epi8` and then check for `!= 0xFFFFFFFF` to see if there was one spot where neither element was zero. That would be the "normal" way to do what you wanted, just using standard techniques, with bitwise booleans on compare results. – Peter Cordes Feb 06 '22 at 14:25
  • @PeterCordes I am really happy to hear that. Thanks for your kind consideration. – Mojtaba Valizadeh Feb 06 '22 at 14:35

1 Answers1

8

You can cleverly combine a vpcmpeqq with a vptest:

__m256i mask = _mm256_cmpeq_epi64(a, _mm256_set1_epi64x(0));
bool result = ! _mm256_testc_si256(mask, b);

The result is true if and only if (~mask & b) != 0 or

((a[i]==0 ? 0 : -1) & b[i]) != 0 // for some i
// equivalent to
((a[i]==0 ? 0 : b[i])) != 0      // for some i
// equivalent to
a[i]!=0 && b[i]!=0               // for some i

which is equivalent to what you want.

Godbolt-link (play around with a and b): https://godbolt.org/z/aTjx7vMKd

If result is a loop condition, the compiler should of course directly do a jb/jnb instruction instead of setnb.

chtz
  • 17,329
  • 4
  • 26
  • 56
  • Huh, after looking at the question's code more carefully, yeah it is only detecting cases where two non-zero qwords are at corresponding positions, treating it as false for cases like `{0,1} , {1,0}`. In which case this would work without having to horizontally-OR the mask like I thought the question was asking for. (Good idea with `vptest`; I'd discounted it because *on its own* it can't do this: [Can PTEST be used to test if two registers are both zero or some other condition?](https://stackoverflow.com/q/43712243)) – Peter Cordes Feb 06 '22 at 10:09
  • @chtz Thanks for your answer. I think this is exactly what I want. Many thanks for your time. – Mojtaba Valizadeh Feb 06 '22 at 13:53
  • Even with AVX-512, we can't do any better, other than avoid the zeroed vector constant: still 3 uops for 2x `VPTESTMQ k1, same,same` into masks, then `ktest` to clear ZF if there are any corresponding non-zero elements. (Zero-masking for the second compare would produce a single ANDed bitmask, but with less ILP, and not helpeful because one of kortest / ktest is still needed anyway.) Since `vptest` is 2 uops on most CPUs except Zen2 (1), this is also 3 uops. – Peter Cordes Feb 06 '22 at 15:10