11

I need to check that all vector elements are non-zero. So far I found following solution. Is there a better way to do this? I am using gcc 4.8.2 on Linux/x86_64, instructions up to SSE4.2.

typedef char ChrVect __attribute__((vector_size(16), aligned(16)));

inline bool testNonzero(ChrVect vect)
{
    const ChrVect vzero = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
    return (0 == (__int128_t)(vzero == vect));
}

Update: code above is compiled to following assembler code (when compiled as non-inline function):

movdqa  %xmm0, -24(%rsp)
pxor    %xmm0, %xmm0
pcmpeqb -24(%rsp), %xmm0
movdqa  %xmm0, -24(%rsp)
movq    -24(%rsp), %rax
orq -16(%rsp), %rax
sete    %al
ret
Daniel Frużyński
  • 2,091
  • 19
  • 28
  • What architecture are you interested in? x86? POWER? ARM? ...? – Alexander Pozdneev Dec 08 '15 at 12:28
  • x86_64, instructions up to SSE4.2 – Daniel Frużyński Dec 08 '15 at 12:48
  • Why not `return ( (__int128_t)(vzero == vect) == 0 )` but `return (0 == (__int128_t)(vzero == vect))`? Maybe it is "modern"? – i486 Dec 08 '15 at 12:52
  • Have you checked to see what code this generates ? – Paul R Dec 08 '15 at 12:53
  • 1
    @i486: this is just a coding style, when you put constant first compiler will complain when you by mistake use = instead of ==. – Daniel Frużyński Dec 08 '15 at 13:00
  • @Paul R: I added generated code to question. – Daniel Frużyński Dec 08 '15 at 13:01
  • @DanielFrużyński: did you compile this with `-O3` ? – Paul R Dec 08 '15 at 13:02
  • @DanielFrużyński I know the reason, but it can be useful for long time Pascal/Delphi programmers. It is not normal to compare constant to variable - the logic is reversed. Also, most compilers have warning message to protect you from such wrong assignment - so, there is 0 risk. – i486 Dec 08 '15 at 13:03
  • 1
    Your code doesn't match you question. Your question wants to know whether all the elements are non-zero. But your code checks whether *any* element is non-zero. – Raymond Chen Dec 08 '15 at 15:17
  • @i486 yes and no. It took me a while to get accustomed to this style, and now this does not make a difference for me when I read some code. And warnings are not perfect - this assumes that warnings are enabled (I already saw commands like gcc -Wall -Werror -w :) ) and people are paying attentions to them (hundreds of warnings do not encourage this). – Daniel Frużyński Dec 08 '15 at 16:47
  • 1
    @RaymondChen: there are two comparisons there. First compares input vector and zero vector, and as a result creates new vector with comparison results for individual elements. When all of them are non-zero, resulting vector will be a zero vector. After casting it to __int128_t value also will be zero, and true will be returned. – Daniel Frużyński Dec 08 '15 at 16:49
  • @DanielFrużyński Ah, you're right. I missed that detail. Thanks. – Raymond Chen Dec 08 '15 at 20:30
  • https://godbolt.org/g/Ytz8gg – Z boson Jan 18 '18 at 13:59

1 Answers1

8

With straight SSE intrinsics you might do it like this:

inline bool testNonzero(__m128i v)
{
    __m128i vcmp = _mm_cmpeq_epi8(v, _mm_setzero_si128());
#if __SSE4_1__  // for SSE 4.1 and later use PTEST
    return _mm_testz_si128(vcmp, vcmp);
#else           // for older SSE use PMOVMSKB
    uint32_t mask = _mm_movemask_epi8(vcmp);
    return (mask == 0);
#endif
}

I suggest looking at what your compiler currently generates for your existing code and then compare it with this version using intrinsics and see if there is any significant difference.

With SSE3 (clang -O3 -msse3) I get the following for the above function:

pxor    %xmm1, %xmm1
pcmpeqb %xmm1, %xmm0
pmovmskb    %xmm0, %ecx
testl   %ecx, %ecx

The SSE4 version (clang -O3 -msse4.1) produces:

pxor    %xmm1, %xmm1
pcmpeqb %xmm1, %xmm0
ptest   %xmm0, %xmm0

Note that the zeroing of xmm1 will typically be hoisted out of any loop containing this function, so the above sequences should be reduced by one instruction when used inside a loop.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1
    Thanks. I tried to benchmark it vs my original code and it is faster. When I used it as a non-inline function, speed increase was quite small - about 2%. When function was inlined, your version was about 26% faster. – Daniel Frużyński Dec 08 '15 at 14:16
  • @DanielFrużyński: you might want to try the updated version above with harold's suggested change - I think this may save a cycle of latency on some CPUs. – Paul R Dec 08 '15 at 14:42
  • 1
    Would `ptest` be of any use here? – Mysticial Dec 08 '15 at 15:18
  • 1
    @PaulR I already did this, results are almost the same for both versions. – Daniel Frużyński Dec 08 '15 at 16:53
  • 2
    @Mysticial: `ptest` doesn't always seem to be a win, even though it saves an instruction - I've added it as an alternative for SSE4 and later anyway, just for completeness. – Paul R Dec 08 '15 at 17:29
  • 3
    @Mysticial: `ptest` would be great if it was a single uop instruction, but it's not. `ptest` / `setcc` is 3 uops total. `pmovmskb` / `test` / `setcc` is also 3 uops. `pmovmskb` / `test/jcc` is 2 uops, because the test/jcc macro-fuses (Intel and AMD). I haven't experimented much with `ptest` in real microbenchmarks, but at least in uop-counting static analysis, it never wins when using it on a `pcmp*` result. It's most likely to be useful with two different inputs, not testing something against itself. Anyway, I'd suggest *not* using it for this, even if available. – Peter Cordes May 27 '16 at 17:45