2

I've problems getting my g++ 5.4 use vectorization for comparison. Basically I want to compare 4 unsigned ints using vectorization. My first approach was straight forward:

bool compare(unsigned int const pX[4]) {
    bool c1 = (temp[0] < 1);
    bool c2 = (temp[1] < 2);
    bool c3 = (temp[2] < 3);
    bool c4 = (temp[3] < 4); 
    return c1 && c2 && c3 && c4;
}

Compiling with g++ -std=c++11 -Wall -O3 -funroll-loops -march=native -mtune=native -ftree-vectorize -msse -msse2 -ffast-math -fopt-info-vec-missed told be, that it could not vectorize the comparison due to misaligned data:

main.cpp:5:17: note: not vectorized: failed to find SLP opportunities in basic block.
main.cpp:5:17: note: misalign = 0 bytes of ref MEM[(const unsigned int *)&x]
main.cpp:5:17: note: misalign = 4 bytes of ref MEM[(const unsigned int *)&x + 4B]
main.cpp:5:17: note: misalign = 8 bytes of ref MEM[(const unsigned int *)&x + 8B]
main.cpp:5:17: note: misalign = 12 bytes of ref MEM[(const unsigned int *)&x + 12B]

Thus my second attempt was to tell g++ to align the data and use a temporary array:

bool compare(unsigned int const pX[4] ) {
    unsigned int temp[4] __attribute__ ((aligned(16)));
    temp[0] = pX[0];
    temp[1] = pX[1];
    temp[2] = pX[2];
    temp[3] = pX[3];

    bool c1 = (temp[0] < 1);
    bool c2 = (temp[1] < 2);
    bool c3 = (temp[2] < 3);
    bool c4 = (temp[3] < 4); 
    return c1 && c2 && c3 && c4;
}

However, same output. AVX2 is supported by my CPU and intel intrinsic guide tells me, there is e.g. _mm256_cmpgt_epi8/16/32/64 for comparison. Any idea how to tell the g++ to use this?

user1228633
  • 521
  • 1
  • 6
  • 16
  • Not sure if there is a Portable way to do it, but if you're just looking to see if all the `bool`s are set or not there are [intrinsics](https://software.intel.com/sites/landingpage/IntrinsicsGuide/) that will tell you if they are all false via bit counts etc. [intel even has an example](https://software.intel.com/en-us/blogs/2013/05/17/processing-arrays-of-bits-with-intel-advanced-vector-extensions-2-intel-avx2) – Mgetz Dec 06 '16 at 19:26
  • There is no 32 bit unsigned compare in SSE/AVX - try it with signed. – Paul R Dec 06 '16 at 19:43
  • AVX2 requires 32 byte alignment – Sean Bright Dec 06 '16 at 19:43
  • Using ``bool compare(signed int const pX[4] ) { signed int temp[4] __attribute__ ((aligned(32))); temp[0] = pX[0]; temp[1] = pX[1]; temp[2] = pX[2]; temp[3] = pX[3]; bool c1 = (temp[0] < 1); bool c2 = (temp[1] < 2); bool c3 = (temp[2] < 3); bool c4 = (temp[3] < 4); return c1 && c2 && c3 && c4; }`` Same problem. – user1228633 Dec 06 '16 at 20:39

1 Answers1

1

Okay, apparently the compiler does not like "unrolled loops". This works for me:

bool compare(signed int const pX[8]) {
    signed int const w[] __attribute__((aligned(32))) = {1,2,3,4,5,6,7,8};
    signed int out[8] __attribute__((aligned(32)));

    for (unsigned int i = 0; i < 8; ++i) {
        out[i] = (pX[i] <= w[i]);
    }

    bool temp = true;
    for (unsigned int i = 0; i < 8; ++i) {
        temp = temp && out[i];
        if (!temp) {
            return false;
        }
    }
    return true;
}

Please note, that out is also a signed int. Now I'll just need a fast way to combine the result saved in out

user1228633
  • 521
  • 1
  • 6
  • 16
  • I also find unrolled loops are problematic for the compiler. An #omp pragma on the fast index should vectorize, and you may need to sum into deep bit depth sum. Another approach is a union where the 2D[n,m] is co-represented as a 1D[n*m] and then that is naturally easy for the compiler. – Holmz Dec 07 '16 at 21:13