3

I'm looking for an SSE Bitwise OR between components of same vector. (Editor's note: this is potentially an X-Y problem, see below for the real comparison logic.)

I am porting some SIMD logic from SPU intrinsics. It has an instruction

spu_orx(a)

Which according to the docs

spu_orx: OR word across d = spu_orx(a) The four word elements of vector a are logically Ored. The result is returned in word element 0 of vector d. All other elements (1,2,3) of d are assigned a value of zero.

How can I do that with SSE 2 - 4 involving minimum instruction? _mm_or_ps is what I got here.

UPDATE:

Here is the scenario from SPU based code:

qword res =  spu_orx(spu_or(spu_fcgt(x, y), spu_fcgt(z, w)))

So it first ORs two 'greater' comparisons, then ORs its result. Later couples of those results are ANDed to get final comparison value.

This is effectively doing (A||B||C||D||E||F||G||H) && (I||J||K||L||M||N||O||P) && ... where A..D are the 4x 32-bit elements of the fcgt(x,y) and so on.

Obviously vertical _mm_or_ps of _mm_cmp_ps results is a good way to reduce down to 1 vector, but then what? Shuffle + OR, or something else?

UPDATE 1

Regarding "but then what?" I perform

     qword res =  spu_orx(spu_or(spu_fcgt(x, y), spu_fcgt(z, w)))

On SPU it goes like this:

 qword aRes  = si_and(res, res1);
 qword aRes1 = si_and(aRes, res2);
 qword aRes2 = si_and(aRes1 , res3);
 return si_to_uint(aRes2 );

several times on different inputs,then AND those all into a single result,which is finally cast to integer 0 or 1 (false/true test)

Michael IV
  • 11,016
  • 12
  • 92
  • 223
  • 1
    Are you interested in exactly that operation, or do you need it, e.g., to determine if any bit is set (e.g., after a compare instructions)? – chtz Jul 18 '19 at 12:19
  • SSE/AVX doesn't have many horizontal operations, and most of the ones that exist are inefficient. What do you *actually* need? If it's branching on a vector compare result, use `_mm_movemask_ps(v) != 0` to check for any element having its high bit set. – Peter Cordes Jul 18 '19 at 13:02
  • @chtz second,I will add an example – Michael IV Jul 18 '19 at 13:46
  • So your vector elements are compare results. And what you really want is SIMD (`vx[0] > vy[0]` || vx[1]>vy[1] || vz[0]>vw[0] || vz[1]>vw[1]) AND (... || ...)` regardless of how that's evaluated. Yeah we do need horizontal OR before we can a vertical AND, so either SIMD shuffle+OR, psadbw, or `movemask` and integer stuff. – Peter Cordes Jul 18 '19 at 14:14
  • Ok, if you eventually want a booleanized 0/1 integer, PTEST -> scalar AND is probably good. – Peter Cordes Jul 19 '19 at 15:04
  • I'm not sure if I understood your question well, but maybe the idea [here](https://godbolt.org/z/vNJAsA), is also of interest for update 1 of your question? – wim Jul 19 '19 at 17:45

1 Answers1

3

SSE4.1 PTEST bool any_nonzero = !_mm_testz_si128(v,v);

That would be a good way to horizontal OR + booleanize a vector into a 0/1 integer. It will compile to multiple instructions, and ptest same,same is 2 uops on its own. But once you have the result as a scalar integer, scalar AND is even cheaper than any vector instruction, and you can branch on the result directly because it sets integer flags.

#include <immintrin.h>
bool any_nonzero_bit(__m128i v) {
    return !_mm_testz_si128(v,v);
}

On Godbolt with gcc9.1 -O3 -march=nehalem:

any_nonzero(long long __vector(2)):
    ptest   xmm0, xmm0                        # 2 uops
    setne   al                                # 1 uop with false dep on old value of RAX
    ret

This is only 3 uops on Intel for a horizontal OR into a single bit in an integer register. AMD Ryzen ptest is only 1 uop so it's even better.

The only risk here is if gcc or clang creates false dependencies by not xor-zeroing eax before doing a setcc into AL. Usually gcc is pretty fanatical about spending extra uops to break false dependencies so I don't know why it doesn't here. (I did check with -march=skylake and -mtune=generic in case it was relying on Nehalem partial-register renaming for -march=nehalem. Even -march=znver1 didn't get it to xor-zero EAX before the ptest.)

It would be nice if we could avoid the _mm_or_ps and have PTEST do all the work. But even if we consider inverting the comparisons, the vertical-AND / horizontal-OR behaviour doesn't let us check something about all 8 elements of 2 vectors, or about any of those 8 elements.

e.g. Can PTEST be used to test if two registers are both zero or some other condition?

  // NOT USEFUL
 // 1 if all the vertical pairs AND to zero.
 // but 0 if even one vertical AND result is non-zero
_mm_testz_si128( _mm_castps_si128(_mm_cmpngt_ps(x,y)), 
                 _mm_castps_si128(_mm_cmpngt_ps(z,w)));

I mention this only to rule it out and save you the trouble of considering this optimization idea. (@chtz suggested it in comments. Inverting the comparison is a good idea that can be useful for other ways of doing things.)


Without SSE4.1 / delaying the horizontal OR

We might be able to delay horizontal ORing / booleanizing until after combining some results from multiple vectors. This makes combining more expensive (imul or something), but saves 2 uops in the vector -> integer stage vs. PTEST.

x86 has cheap vector mask->integer bitmap with _mm_movemask_ps. Especially if you ultimately want to branch on the result, this might be a good idea. (But x86 doesn't have a || instruction that booleanizes its inputs either so you can't just & the movemask results).

One thing you can do is integer multiply movemask results: x * y is non-zero iff both inputs are non-zero. Unlike x & y which can be false for 0b0101 &0b1010for example. (Our inputs are 4-bit movemask results andunsigned` is 32-bit so we have some room before we overflow). AMD Bulldozer family has an integer multiply that isn't fully pipelined so this could be a bottleneck on old AMD CPUs. Using just 32-bit integers is also good for some low-power CPUs with slow 64-bit multiply.

This might be good if throughput is more of a bottleneck than latency, although movmskps can only run on one port.

I'm not sure if there are any cheaper integer operations that let us recover the logical-AND result later. Adding doesn't work; the result is non-zero even if only one of the inputs was non-zero. Concatenating the bits together (shift+or) is also of course like an OR if we eventually just test for any non-zero bit. We can't just bitwise AND because 2 & 1 == 0, unlike 2 && 1.


Keeping it in the vector domain

Horizontal OR of 4 elements takes multiple steps.

The obvious way is _mm_movehl_ps + OR, then another shuffle+OR. (See Fastest way to do horizontal float vector sum on x86 but replace _mm_add_ps with _mm_or_ps)

But since we don't actually need an exact bitwise-OR when our inputs are compare results, we just care if any element is non-zero. We can and should think of the vectors as integer, and look at integer instructions like 64-bit element ==. One 64-bit element covers/aliases two 32-bit elements.

__m128i cmp = _mm_castps_si128(cmpps_result);               // reinterpret: zero instructions
                 // SSE4.1 pcmpeqq 64-bit integer elements
__m128i cmp64 = _mm_cmpeq_epi64(cmp, _mm_setzero_si128());  // -1 if both elements were zero, otherwise 0
__m128i swap =  _mm_shuffle_epi32(cmp64, _MM_SHUFFLE(1,0, 3,2));  // copy and swap, no movdqa instruction needed even without AVX
__m128i bothzero = _mm_and_si128(cmp64, swap);              // both halves have the full result

After this logical inversion, ORing together multiple bothzero results will give you the AND of multiple conditions you're looking for.

Alternatively, SSE4.1 _mm_minpos_epu16(cmp64) (phminposuw) will tell us in 1 uop (but 5 cycle latency) if either qword is zero. It will place either 0 or 0xFFFF in the lowest word (16 bits) of the result in this case.

If we inverted the original compares, we could use phminposuw on that (without pcmpeqq) to check if any are zero. So basically a horizontal AND across the whole vector. (Assuming that it's elements of 0 / -1). I think that's a useful result for inverted inputs. (And saves us from using _mm_xor_si128 to flip the bits).

An alternative to pcmpeqq (_mm_cmpeq_epi64) would be SSE2 psadbw against a zeroed vector to get 0 or non-zero results in the bottom of each 64-bit element. It won't be a mask, though, it's 0xFF * 8. Still, it's always that or 0 so you can still AND it. And it doesn't invert.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    For the exact use case of the OP you could use the "anding" of `ptest` to save the `orps`, by negating the comparisons, i.e., `_mm_testz_si128(_mm_castps_si128(_mm_cmpngt_ps(x,y)), _mm_castps_si128(_mm_cmpngt_ps(z,w)));` – chtz Jul 18 '19 at 15:57
  • To force zeroing `eax` before `set** al` you can return `int` instead of `bool` (not sure how much influence this has once it is inlined): https://godbolt.org/z/axxOo5 – chtz Jul 18 '19 at 16:02
  • @chtz: IDK, that might help even after inlining. Worth a try if a false dep causes a bottleneck or reduces ILP. With good choices of destination registers, spending extra uops on xor-zeroing wouldn't be worth it. On AMD (no partial-reg renaming), even using `AL` and `AH`, and DL,DH, as separate destinations would be good if there are many vectors, allowing SWAR AND before eventually doing one last horizontal AND with `test al,ah`. But on Intel that would be bad with partial-register merging uops. – Peter Cordes Jul 18 '19 at 16:07
  • 1
    @chtz: Good idea with inverting the inputs so you need a vertical AND. But doesn't `testz` still give you a horizontal OR when you need a horizontal AND? Maybe I'm mixing something up, gtg for lunch right now. Related [Can PTEST be used to test if two registers are both zero or some other condition?](//stackoverflow.com/q/43712243) – Peter Cordes Jul 18 '19 at 16:10
  • @chtz: You had me convinced for a minute, but I was finding problems commenting the code while making an edit to the answer. Yeah, vertical-AND / horizontal-NOR doesn't give us all-true or all-false for 2 separate inputs. – Peter Cordes Jul 18 '19 at 16:37