16

I am new to GCC's C vector extensions. According to the manual, the result of comparing one vector to another in the form (test = vec1 > vec2;) is that "test" contains a 0 in each element that is false and a -1 in each element that is true.

But how to very quickly check if ANY of the element comparisons was true? And, further, how to tell which is the first element for which the comparison was true?

For example, with:

vec1 = {1,1,3,1};
vec2 = {1,2,2,2};
test = vec1 > vec2;

I want to determine if "test" contains any truth (non-zero elements). In this case I want "test" to reduce to true, because there exists an element for which vec1 is greater than vec2 and hence an element in test containing -1.

Additionally, or alternatively, I want to quickly discover WHICH element fails the test. In this case, this would simply be the number 2. Said another way, I want to test which is the first non-zero element.

int hasAnyTruth = ...; // should be non-zero. "bool" works too since C99
int whichTrue = ...; // should contain 2, because test[2] == -1

I imagine we could use a simd reduction-addition command (?) to sum everything in the vector into a number and compare that sum to 0, but I don't know how (or if there is a faster way). I am guessing some form of argmax is necessary for the second question, but again, I don't know how to instruct GCC to use it on the vectors.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
user1649948
  • 651
  • 4
  • 12
  • 6
    `_mm_movemask_epi8()` – Mysticial Jul 23 '15 at 20:49
  • Wow, I like this. 1) Is it portable? 2) Any advantage over memcmp? 3) Does it work with 256-bit registers (AVX), or vectors with different numbers of elements? – user1649948 Jul 23 '15 at 20:56
  • It's more portable than GCC vector extensions. It's standardized by Intel, so it will work in every major compiler: GCC, Clang, MSVC, ICC, etc... https://software.intel.com/sites/landingpage/IntrinsicsGuide/ – Mysticial Jul 23 '15 at 20:59
  • 1
    There's an instruction for that on x86: `ptest`. – EOF Jul 23 '15 at 21:22
  • That's cool. I guess that's similar to GCC's __builtin_ia32_pmovmskb (v8qi)? Or can _mm_movemask_epi8 be used on GCC's vector extension types too like above? I might be compiling for arbitrary processors (gcc's __builtin_shuffle and other vector types can compile to Altivec, NEON, SSE, etc). Is it faster than memcmp? – user1649948 Jul 23 '15 at 21:29
  • 1
    I suspect the fastest way to implement `memcmp()` on an x86 with (at least) sse4_1 will use `ptest`. If you want to use it in gcc, it is available on x86 microarchitectures that support it as `__builtin_ia32_ptestc128/ptestnzc128/ptestz128/256`. – EOF Jul 23 '15 at 21:36
  • Nice. I'm kind of intrigued by the movemask thing, because it seems to both tell you of truth and WHERE the truth is in one number. Is there a version of this for AVX's 256-bit registers, or is it limited to 128-bit? (Oh, AVX2: int _mm256_movemask_epi8 (__m256i a)) – user1649948 Jul 23 '15 at 21:48
  • I should also mention I discovered reduction is a terrible idea for this sort of thing because of this: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/01/OpenCL-Optimization-Figure1.jpg – user1649948 Jul 24 '15 at 01:45
  • Speaking of ptest, what's the difference between ptest- c, nzc, and z? I just care whether everything is 0 or not, so which is fastest/most applicable? – user1649948 Jul 24 '15 at 01:51
  • Is this for x86 only? Because then it should be tagged with that, and yes, you probably should just use these: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ – Gábor Buella Oct 24 '15 at 13:17
  • But if not only for x86, then look for things among GCC builtins, or just write a simple loop testing each member of the vector -- hoping for GCC to optimize it -- look at the resulting assembly – Gábor Buella Oct 24 '15 at 13:19

4 Answers4

3

Clang's vector extension do a good job with the any function.

#if defined(__clang__)
typedef int64_t vli __attribute__ ((ext_vector_type(VLI_SIZE)));
typedef double  vdf __attribute__ ((ext_vector_type(VDF_SIZE)));
#else
typedef int32_t vsi __attribute__ ((vector_size (SIMD_SIZE)));
typedef int64_t vli __attribute__ ((vector_size (SIMD_SIZE)));
#endif

static bool any(vli const & x) {
  for(int i=0; i<VLI_SIZE; i++) if(x[i]) return true;
  return false;
}

Assembly

any(long __vector(4) const&): # @any(long __vector(4) const&)
  vmovdqa ymm0, ymmword ptr [rdi]
  vptest ymm0, ymm0
  setne al
  vzeroupper
  ret

Although pmovmskb might still be a better choice ptest is still a huge improvement over what GCC does

any(long __vector(4) const&):
  cmp QWORD PTR [rdi], 0
  jne .L5
  cmp QWORD PTR [rdi+8], 0
  jne .L5
  cmp QWORD PTR [rdi+16], 0
  jne .L5
  cmp QWORD PTR [rdi+24], 0
  setne al
  ret
.L5:
  mov eax, 1
  ret

GCC should fix this. Clang is not optimal for AVX512 though.

The any function I would argue is a critical vector function so compilers should either provide a builtin like they do for shuffle (e.g. __builtin_shuffle for GCC and __builtin_shufflevector for clang) or the compiler should be smart enough to figure out the optimal code like Clang does at least for SSE and AVX but not AVX512.

Z boson
  • 32,619
  • 11
  • 123
  • 226
  • 1
    +1 I ran outta up-votes when you first replied but nice answer! Sorry about the one I community-posted before copying Mysticial's answer from the comments. That was back at a time when I got all into trying to moderate and preserve "site posterity". I'd delete it except I can't until the answer is switched. –  Jan 22 '18 at 19:18
1

From Mystical:

_mm_movemask_epi8()

It's more portable than GCC vector extensions. It's standardized by Intel, so it will work in every major compiler: GCC, Clang, MSVC, ICC, etc...

http://software.intel.com/sites/landingpage/IntrinsicsGuide

  • 3
    This is hardly more portable than GCC's vector extensions. E.g. It's useless on ARM. – Z boson Jan 17 '18 at 14:29
  • @Zboson I kinda have to refer to Mysticial for a response as I ended up just community posting his comment from the question as an answer for posterity as the question originally had no answers for a good while and seemed resolved through comments -- definitely not hardware-portable. –  Jan 17 '18 at 15:03
0

Here's what I ended up using in one case:

#define V_EQ(v1, v2) \
  ({ \
    __typeof__ (v1) v_d = (v1) != (v2); \
    __typeof__ (v_d) v_0 = { 0 }; \
    memcmp (&v_d, &v_0, sizeof v_d) == 0; \
  })

assert (V_EQ (v4ldblo, v4ldbli - 1));
tschwinge
  • 346
  • 1
  • 5
-1

For doing this we can use intrinsic functions,by using intrinsic functions we can achieve more speed in execution of code. Please refer below link

Harish
  • 341
  • 1
  • 13