6

I will be thankful if somebody can help in writing a function that receives an AVX vector and checks if it contains any element greater than zero ..

I have written the following code but it is not optimal because it stores the elements and then manipulate it.. the vector should be checked as a whole.

int check(__m256 vector)
{
  float * temp;
  posix_memalign ((void **) &temp, 32, 8 * sizeof(float));    
  _mm256_store_ps( temp, vector );

  int flag=0;
  for(int k=0; k<8; k++)
  {
    flag= ( (temp[k]>0) ? 1 : 0 );
    if (flag==1) return 1;
  }

  free( temp);
  return 0;
}
MROF
  • 147
  • 1
  • 3
  • 9
  • 6
    There's this document called the Intel Software Developer manual which you should grab. When you look at it, you'll see in Volume B, Chapter 3 a list of all instructions as well the intrinsics you can use for each. Here you want `__m256 vcmp = _mm256_cmp_ps(_mm256_setzero_ps(), x, _CMP_LT_OQ)`, followed by `int cmp = _mm256_movemask_ps(vcmp)` to pick out and pack together the comparison results. If `cmp = 0xFF`, your condition is satisfied. – Iwillnotexist Idonotexist Oct 20 '14 at 20:52
  • @IwillnotexistIdonotexist - that would qualify as a very relevant answer, I think. – ryyker Oct 20 '14 at 20:53
  • @ryyker I suppose so but I could have sworn a variant of this had already been asked and was looking for it. I recall in a distant past an SSE variant of this, but perhaps it was checking equality to 0. – Iwillnotexist Idonotexist Oct 20 '14 at 21:00
  • @MROF I just realized that w.r.t. my comment above, the correct comparison to find if _any_ element is >0 is `cmp != 0`, not `cmp == 0xFF` which finds if _all_ elements >0. – Iwillnotexist Idonotexist Oct 20 '14 at 21:09
  • It is working ^_^ Many thanks for your cooperation. – MROF Oct 21 '14 at 20:33
  • The `return 1` path leaks the `temp` buffer. – Peter Cordes Jun 08 '15 at 07:52

1 Answers1

4

If you're going to branch on the result, it's usually fewer uops to use the "traditional" compare / movemask / integer-test, like you would with SSE1.

__m256 vcmp = _mm256_cmp_ps(_mm256_setzero_ps(), x, _CMP_LT_OQ);
int cmp = _mm256_movemask_ps(vcmp);
if (cmp)
    return 1;

This typically compiles to something like

vcmplt_oqps  ymm2, ymm0, ymm1
vpmovmskb    eax, ymm2

test         eax,eax
jnz      .true_branch

Those are all single-uop instructions, and test/jnz macro-fuse on Intel and AMD CPUs that support AVX, so this is only 3 total uops (on Intel).

See Agner Fog's instruction tables + microarch guide, and other guides linked from https://stackoverflow.com/tags/x86/info.


You can also use PTEST, but it's less efficient for this case. See _mm_testc_ps and _mm_testc_pd vs _mm_testc_si128

Without AVX, ptest handy for checking if a register is all-zero without needing extra instructions to copy it (since it sets integer flags directly). But since it's 2 uops, and can't macro-fuse with a jcc branch instruction, it's actually worse than the above:

// don't use, sub-optimal
__m256 vcmp = _mm256_cmp_ps(_mm256_setzero_ps(), x, _CMP_LT_OQ);
if (!_mm256_testz_si256(vcmp, vcmp)) {
    return 1;
}

The testz intrinsic is PTEST. It sets the ZF and CF flags directly based on the results of AND and AND NOT of its args. The testz intrinsic is true when vcmp has any non-zero bits. (which it will only when vcmpps puts some there.)

VPTEST with ymm regs is available with just AVX. AVX2 isn't required even though it looks like a vector-integer instruction.

This will compile to something like

vcmplt_oqps  ymm2, ymm0, ymm1
vptest       ymm2, ymm2
jnz      .true_branch

Probably smaller code-size than the above, but this is actually 4 uops instead of 3. If you were using setnz or cmovnz, macro-fusion wouldn't be a factor, so ptest would be break-even. As I mentioned above, the main use-case for ptest is when you can use it without a compare instruction, and without AVX.

The alternative for checking a vector for all-zero (pcmpeqb xmm0,xmm1 / pmovmskb eax, xmm1 / test eax,eax) has to destroy one of the input vectors without AVX, so it will require an extra movdqa instruction to copy if you still need both after the test.


ptest floating point bit-hacks

I think for this specific test, it might be possible to skip the compare instruction and use vptest directly to see if there are any float elements with their sign bit unset, but some non-zero bits elsewhere.

Actually no, that idea can't work, because it doesn't respect element boundaries. It couldn't tell the difference between a vector with a positive element vs. a vector with a +0.0 element (sign bit clear) and another element that was negative (other bits set).

vptest sets CF=bool(~src1 & src2) and ZF=(src1 & src2). I was thinking that src1=set1(0x7FFFFFFF) could tell us something useful about sign bits and non-sign bits, which we could test with a condition that checks CF and ZF. For example ja: CF=0 and ZF=0. There actually isn't an x86 condition that's only true with CF=1 and ZF=0, though, so that's another problem.

Also NaN > 0 is false, but NaN has some set bits. (exponent all-ones, mantissa non-zero, sign-bit = don't care so there can be +NaN and -NaN). If that was the only problem, this would still be useful in cases where NaN-handling isn't required.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847