4

I have a piece of code which does a comparison on array elements if they are > than a value, in SIMD-ish fashion:

void sse(uint *dst, size_t N)
{
    const __m128i condition = _mm_set1_epi32(2);

    for (uint i = 0; i < N; i += 4)
    {
        __m128i v = _mm_load_si128((__m128i *)&dst[i]);
        __m128i cmp = _mm_cmpgt_epi32(v, condition);
        v = _mm_and_si128(v, cmp);
        _mm_store_si128((__m128i *)&dst[i], v);
    }
}

Now, after the comparison, before anding elements - _mm_and_si128, I want to count elements which passed the condition, i.e. those set to '1', and store the sum in an int variable. How can I do that in SIMD? For instance, if out of four only two passed the condition, have this int var = 2.

Paul R
  • 208,748
  • 37
  • 389
  • 560
Nik Kovac
  • 185
  • 5
  • 12
  • See also [my answer on another question](http://stackoverflow.com/a/35270026/224132) comparing different ways of doing a horizontal sum. – Peter Cordes Apr 23 '16 at 19:09

1 Answers1

7

Typically you would keep a vector count throughout the loop and then just sum the elements of the vector when the loop terminates, e.g.

#include <emmintrin.h>

uint32_t sse(const uint32_t *dst, const size_t N)
{
    const __m128i condition = _mm_set1_epi32(2);
    __m128i vcount = _mm_set1_epi32(0);
    uint32_t count = 0;

    for (size_t i = 0; i < N; i += 4)
    {
        __m128i v = _mm_load_si128((__m128i *)&dst[i]);
        __m128i vcmp = _mm_cmpgt_epi32(v, condition);
        v = _mm_and_si128(v, vcmp);
        _mm_store_si128((__m128i *)&dst[i], v);
        vcount = _mm_add_epi32(vcount, vcmp); // accumulate (negative) counts
    }
    // ... sum vcount here and store in count (see below) ...
    return count;
}

Note that we are treating each mask element as an int, i.e. 0 or -1, and so we are accumulating a sum which is the negative of the actual sum.

Efficiency of the final vcount summation is not normally too important as it is performed only once for the entire loop, so provided N is reasonably large it does not matter how many instructions are required (within reason).

There are several ways of handling the final summation, e.g. you can use _mm_movemask_epi8 (SSE2) to extract a 16 bit mask and work with that, or you can use _mm_hadd_epi32 (SSSE3) to calculate a horizontal sum on the vector and then extract the sum as a scalar, e.g.

SSE2:

#include <emmintrin.h>

int16_t mask = _mm_movemask_epi8(vcount);       // extract 16 bit mask
count = !!(mask & 0x0001) +                     // count non-zero 32 bit elements
        !!(mask & 0x0010) + 
        !!(mask & 0x0100) + 
        !!(mask & 0x1000);

SSSE3:

#include <tmmintrin.h>

vcount = _mm_hadd_epi32(vcount, vcount);        // horizontal sum of 4 elements
vcount = _mm_hadd_epi32(vcount, vcount);
count = - ((_mm_extract_epi16(vcount, 1) << 16) // extract (and negate) sum to
          | _mm_extract_epi16(vcount, 1));      // get total (positive) count

SSE4.2:

#include <smmintrin.h>

vcount = _mm_hadd_epi32(vcount, vcount);        // horizontal sum of 4 elements
vcount = _mm_hadd_epi32(vcount, vcount);
count = - _mm_extract_epi32(vcount, 0);         // extract (and negate) sum to
                                                // get total (positive) count

Here is a complete working version with test harness for the SSE4.2 version:

#include <stdio.h>
#include <stdint.h>
#include <smmintrin.h>

uint32_t sse(const uint32_t *dst, const size_t N)
{
    const __m128i condition = _mm_set1_epi32(2);
    __m128i vcount = _mm_set1_epi32(0);
    uint32_t count = 0;

    for (size_t i = 0; i < N; i += 4)
    {
        __m128i v = _mm_load_si128((__m128i *)&dst[i]);
        __m128i vcmp = _mm_cmpgt_epi32(v, condition);
        v = _mm_and_si128(v, vcmp);
        _mm_store_si128((__m128i *)&dst[i], v);
        vcount = _mm_add_epi32(vcount, vcmp); // accumulate (negative) counts
    }

    vcount = _mm_hadd_epi32(vcount, vcount);    // horizontal sum of 4 elements
    vcount = _mm_hadd_epi32(vcount, vcount);
    count = - _mm_extract_epi32(vcount, 0);     // extract (and negate) sum to
                                                // get total (positive) count

    return count;
}

int main(void)
{
    uint32_t a[4] __attribute__ ((aligned(16))) = { 1, 2, 3, 4 };
    uint32_t count;

    count = sse(a, 4);

    printf("a = %u %u %u %u \n", a[0], a[1], a[2], a[3]);
    printf("count = %u\n", count);

    return 0;
}

$ gcc -Wall -std=c99 -msse4 sse_count.c -o sse_count
$ ./sse_count
a = 0 0 3 4 
count = 2
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • thanks for your detailed response. Why do we have though _hadd two times on the same variable vcount? So the expression 'vcount = _mm_hadd_epi32(vcount, vcount);' appears two times, the same (same args)? this code though works only on icc but not on gcc, or am I failing to include the proper header? Currently I have included and compiling with option -msse4? – Nik Kovac Jun 17 '13 at 09:25
  • If you read the description of `_mm_hadd_epi32` in the [Intel Instrinsics Guide](http://software.intel.com/en-us/articles/intel-intrinsics-guide) you'll see that it adds adjacent *pairs* of elements, so we need to do it twice to get the sum of all 4 elements. – Paul R Jun 17 '13 at 09:28
  • this methond is producing very large meaningless number.. as example I tried it with uint arr[] = {1,2,3,4}; uint c = sse2(arr, 4);, but gives a very large number.. For the gcc I get "_mm_extract_epi32’ was not declared in this scope" – Nik Kovac Jun 17 '13 at 09:34
  • can you also correct the line: vcount = _mm_add_epi32(vcount, vcmp); it should be 'v' and not 'vcmp'. It might be useful for other readers in future. – Nik Kovac Jun 17 '13 at 09:36
  • 1
    I've added a complete working version with test harness above which compiles and runs as expected with gcc. – Paul R Jun 17 '13 at 09:47
  • 1
    thats perfect Paul thanks for your very clean explanation. The problem with g++ went away when I added the flag -msse4.1 – Nik Kovac Jun 17 '13 at 09:54
  • 1
    The SSSE3 version should just use `movd` (`int _mm_cvtsi128_si32 (__m128i a)`), which is what `_mm_extract_epi32(vcount, 0)` compiles to anyway. You absolutely don't want the compiler to emit two `pextrw` instructions and shift/OR them together! Also, see [my answer on another question](http://stackoverflow.com/a/35270026/224132) for some other ways to do a hsum: more insns but fewer uops, which might be a win esp. with a uop cache. Compilers can and do transform one hsum idiom to another, though. – Peter Cordes Apr 23 '16 at 19:09