6

When using the GCC vector extensions for C, how can I check that all the values on a vector are zero?

For instance:

#include <stdint.h>

typedef uint32_t v8ui __attribute__ ((vector_size (32)));

v8ui*
foo(v8ui *mem) {
    v8ui v;
    for ( v = (v8ui){ 1, 1, 1, 1, 1, 1, 1, 1 };
          v[0] || v[1] || v[2] || v[3] || v[4] || v[5] || v[6] || v[7];
          mem++)
        v &= *(mem);

    return mem;
}

SSE4.2 has the PTEST instruction which allows to run a test like the one used as the for condition but the code generated by GCC just unpacks the vector and checks the single elements one by one:

.L2:
        vandps  (%rax), %ymm1, %ymm1
        vmovdqa %xmm1, %xmm0
        addq    $32, %rax
        vmovd   %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vpextrd $1, %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vpextrd $2, %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vpextrd $3, %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vextractf128    $0x1, %ymm1, %xmm0
        vmovd   %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vpextrd $1, %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vpextrd $2, %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vpextrd $3, %xmm0, %edx
        testl   %edx, %edx
        jne     .L2
        vzeroupper
        ret

Is there any way to get GCC to generate an efficient test for that without reverting to using intrinsics?

Update: For reference, code using an unportable GCC builtin for (V)PTEST:

typedef uint32_t v8ui __attribute__ ((vector_size (32)));
typedef long long int v4si __attribute__ ((vector_size (32)));

const v8ui ones = { 1, 1, 1, 1, 1, 1, 1, 1 };

v8ui*
foo(v8ui *mem) {
    v8ui v;
    for ( v = ones;
          !__builtin_ia32_ptestz256((v4si)v,
                                    (v4si)ones);
          mem++)
        v &= *(mem);

    return mem;
}
salva
  • 9,943
  • 4
  • 29
  • 57
  • 1
    there's no way to get gcc to use pretty much any instruction, and if you do find a way, it probably won't work on other optimization levels or other versions of gcc. worse yet, tricking the compiler to emit a specific instruction essentially pigeonholes your program to only work (performance-wise) on a single compiler. is that really any more portable than intrinsics or asm? – Steve Cox Apr 06 '15 at 14:04
  • also of note, a ptest would never be equivalent to v[0] || v[1] || v[2] || v[3] || v[4] || v[5] || v[6] || v[7] because short circuit evaluation requires a jump after every individual boolean expression – Steve Cox Apr 06 '15 at 14:06
  • @SteveCox, maybe my wording was not clear, my aim is to get GCC to generate efficient machine code. PTEST is just one way. – salva Apr 06 '15 at 14:09
  • `v[0] | v[1] | v[2] | v[3] | v[4] | v[5] | v[6] | v[7]` will be faster because its branch free, but still not as fast as the actual intrinsic – Steve Cox Apr 06 '15 at 14:12
  • @SteeveCox, in this case they are obviously equivalent as none of the conditions has side effects. Anyway, that's missing the point of the question. I just want to know if that kind of test could be expressed in a way that would get GCC to generate efficient code! – salva Apr 06 '15 at 14:12
  • no they're not equivalent still, if `v[0]!=0` none of the other tests are allowed to happen – Steve Cox Apr 06 '15 at 14:13
  • 4
    @SteveCox: Again, those are side-effect free tests. Generating code that short-circuits them or not is up to the compiler. It may even reorder them! – salva Apr 06 '15 at 14:20
  • 1
    patently false. the compiler has no leeway to reorder those tests. `v[0]==0` could imply that `v+1` is an invalid memory address say for oh i don't know C STRINGS. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf read 6.5.14 (in particular statement 4) – Steve Cox Apr 06 '15 at 14:31
  • @SteveCox: the compiler knows `v` is in a register. No bad memory address errors are possible! – salva Apr 06 '15 at 14:35
  • @SteveCox, if `v+1` is invalid, isn't dereferencing it undefined behavior? In which case the compiler isn't required to do anything. – Samuel Edwin Ward Apr 06 '15 at 15:04
  • 1
    @SamuelEdwinWard: here `v` is not an array or a pointer. See [GCC Vector Extensions](https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html). – salva Apr 06 '15 at 15:09
  • 2
    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56829 – Marc Glisse Apr 07 '15 at 19:33
  • 1
    `__builtin_ia32_ptestz256` is not portable across compilers but `_mm256_testz_si256` is for x86 code. – Z boson Apr 15 '15 at 11:40

3 Answers3

2

gcc 4.9.2 -O3 -mavx2 (in 64bit mode) didn't realize it could use ptest for this, with either || or |.

The | version extracts the vector elements with vmovd and vpextrd, and combines things with 7 or insns between 32bit registers. So it's pretty bad, and doesn't take advantage of any simplifications that will still produce the same logical truth value.

The || version is just as bad, and does the same extract-an-element-at-a-time, but does a test / jne for every one.

So at this point, you can't count on GCC recognizing tests like this and doing anything remotely efficient. (pcmpeq / movmsk / test is another sequence that wouldn't be bad, but gcc doesn't generate that either.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
1

Wouldn't vptest help? If you are looking at performance, sometimes you'll be surprised by what the native type can provide. Here is some code that uses vanilla memcmp() and also the vptest instruction (employed via the corresponding intrinsic). I did not time the functions.

#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <immintrin.h>

typedef uint32_t v8ui __attribute__ ((vector_size (32)));

v8ui*
foo1(v8ui *mem)
{   
    v8ui v = (v8ui){ 1, 1, 1, 1, 1, 1, 1, 1 };

    if (memcmp(mem, &v, sizeof (v8ui)) == 0) {
            printf("Ones\n");
    } else {
            printf("NOT Ones\n");
    }

    return mem;
}

v8ui*
foo2(v8ui *mem)
{   
    v8ui v = (v8ui){ 1, 1, 1, 1, 1, 1, 1, 1 };
    __m256i a, b;

    a = _mm256_loadu_si256((__m256i *)(&v));
    b = _mm256_loadu_si256((__m256i *)(&mem));

    if (!_mm256_testz_si256(a, b)) {
            printf("NOT Ones\n");
    } else {
            printf("Ones\n");
    }

    return mem;
}

int
main()
{
    v8ui v = (v8ui){ 1, 1, 1, 1, 1, 1, 1, 1 };
    foo1(&v);
    foo2(&v);
}

Compile flags:

gcc -mavx2 foo.c

Doh! Only now did I see that you wanted to get GCC to generate the vptest instruction without using the intrinsics. I'll leave the code around anyway.

pavan
  • 31
  • 4
0

If the compiler isn't optimal enough to produce an optimisation automatically, you have three options:

  • Get a new compiler.
  • Produce the optimisation manually (eg. using intrinsics such as in your test and the other answer).
  • Modify the compiler to produce the optimisation automatically.

You've pretty much excluded the first option automatically by using gcc extensions, though llvm/clang might extend these extensions for you.

You've excluded the second option quite blatantly.

The third option seems like your best option to me. gcc is open source, so you can make (and commit) your own changes to it. If you can modify gcc to produce this optimisation automatically (ideally from 100% standard C), then you'll not only achieve your goal of producing this optimisation without introducing crud into your program, but you'll also save countless manual optimisations (especially the non-standard ones that lock you into using a particular compiler) in the future.

autistic
  • 1
  • 3
  • 35
  • 80