SSE 4 popcount for 16 8-bit values?

Question

I have the following code which compiles with GCC using the flag -msse4 but the problem is that the pop count only gets the last four 8-bits of the converted __m128i type. Basically what I want is to count all 16 numbers inside the __m128i type but I'm not sure what intrinsic function call to make after creating the variable popA. Somehow popA has to be converted into an integer that contains all the 128-bits of information? I suppose theres _mm_cvtsi128_si64 and using a few shuffle few operations but my OS is 32-bit. Is there only the shuffle method and using _mm_cvtsi128_si32?

EDIT: If the shuffle method is the only option I need help implementing it for my 32-bit OS, please.

Heres the code.

#include <stdio.h>
#include <smmintrin.h>
#include <emmintrin.h>

int main(void)
{
    int A = 1;
    __m128i popA = _mm_set_epi8( A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A);

    unsigned int integer = _mm_cvtsi128_si32(popA);
    //long long LONG = _mm_cvtsi128_si64(popA);//my OS is 32-bits so no luck here

    printf("integer = %d\n", integer);
    int pop = _mm_popcnt_u32(integer);
    //int popLONG = _mm_popcnt_u64(LONG);
    printf("popcount = %d\n", pop);
    //printf("popcount LONG = %d\n", popLONG);

    return 0;
}

EDIT 2: This one finally runs (with GCC compiler flags -msse -msse2 -msse3 -msse4) although I'm not sure if the output for pop_count1() is correct.

Output: pop_count1(): 1799 1799 1799 1799 1799 1799 1799 1799

pop_count2():population count for each byte: 1 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7

  #include <stdio.h>
#include <xmmintrin.h>
#include <emmintrin.h>
#include <mmintrin.h>
#include <stdint.h>
#include <tmmintrin.h>

void print128_num(__m128i var)
{
    uint16_t *val = (uint16_t*) &var;
    printf("pop_count1(): %i %i %i %i %i %i %i %i \n",
           val[0], val[1], val[2], val[3], val[4], val[5],
           val[6], val[7]);
}
static __m128i parallelPopcnt16bytes (__m128i xmm)//for pop_count2
{
    const __m128i mask4 = _mm_set1_epi8 (0x0F);
    const __m128i lookup = _mm_setr_epi8 (0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
   __m128i low, high, count;

   low = _mm_and_si128 (mask4, xmm);
   high = _mm_and_si128 (mask4, _mm_srli_epi16 (xmm, 4));
   count = _mm_add_epi8 (_mm_shuffle_epi8 (lookup, low), _mm_shuffle_epi8 (lookup, high));
   return count;
}
void pop_count1()
{
    int A = 1;
    __m128i in = _mm_set_epi8( A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A);
    __m128i bit0 = _mm_set1_epi8( 0x80 );
    __m128i mask0 = _mm_and_si128( in, bit0 );
    __m128i sum = _mm_cmpeq_epi8( mask0, _mm_setzero_si128() );

/* general pattern */
    __m128i bit1 = _mm_set1_epi8( 0x40 );
    __m128i mask1 = _mm_and_si128( in, bit1 );
    mask1 = _mm_cmpeq_epi8( mask1, _mm_setzero_si128() );
    sum = _mm_add_epi8( sum, mask1 );

/* next bit */
    __m128i bit2 = _mm_set1_epi8( 0x20 );
    __m128i mask2 = _mm_and_si128( in, bit2 );
    mask2 = _mm_cmpeq_epi8( mask2, _mm_setzero_si128() );
    sum = _mm_add_epi8( sum, mask2 );

    __m128i bit3 = _mm_set1_epi8( 0x10 );
    __m128i mask3 = _mm_and_si128( in, bit3 );
    mask3 = _mm_cmpeq_epi8( mask3, _mm_setzero_si128() );
    sum = _mm_add_epi8( sum, mask3 );

    __m128i bit4 = _mm_set1_epi8( 0x08 );
    __m128i mask4 = _mm_and_si128( in, bit4 );
    mask4 = _mm_cmpeq_epi8( mask4, _mm_setzero_si128() );
    sum = _mm_add_epi8( sum, mask4 );

    __m128i bit5 = _mm_set1_epi8( 0x04 );
    __m128i mask5 = _mm_and_si128( in, bit5 );
    mask5 = _mm_cmpeq_epi8( mask5, _mm_setzero_si128() );
    sum = _mm_add_epi8( sum, mask5 );

    __m128i bit6 = _mm_set1_epi8( 0x02 );
    __m128i mask6 = _mm_and_si128( in, bit6 );
    mask6 = _mm_cmpeq_epi8( mask6, _mm_setzero_si128() );
    sum = _mm_add_epi8( sum, mask6 );

    __m128i bit7 = _mm_set1_epi8( 0x01 );
    __m128i mask7 = _mm_and_si128( in, bit7 );
    mask7 = _mm_cmpeq_epi8( mask7, _mm_setzero_si128() );
    sum = _mm_add_epi8( sum, mask7 );

/* finish up */
    sum = _mm_sub_epi8( _mm_setzero_si128(), sum );

    print128_num(sum);
}
void pop_count2()
{
    int index;
    __m128i testVector = _mm_set_epi8 (1, 2, 4, 8, 16, 32, 64, 128, 0, 1, 3, 7, 15, 31, 63, 127);
    __m128i counts = parallelPopcnt16bytes (testVector);

    printf ("pop_count2():population count for each byte:");
    for (index = 15; index >= 0; index--)
        {
        uint8_t *bytes = (void *) &counts;
        printf (" %d", bytes [index]);
        }
    printf ("\n");
}
int main(void)
{
    pop_count1();
    pop_count2();

    return 0;
}

Do you want a single population count for the entire 128 bit vector or do you want 16 population counts, one for each 8 bit element ? — Paul R, Jul 08 '13 at 07:49
Which ever would make more sense to use efficiently. I now figure that using the on chip popcount is not great for a set of `int` and its also useless on a 32-bit OS to use 64-bit data. popcnt is not the swiss army knife of instructions so far in its infancy. Maybe by SSE 5 it will be something great. — pandoragami, Jul 08 '13 at 12:44
@user2555139 Sorry, it's `_mm_and()` not `_mm_andps`. And the sequence is 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01. Actually it doesn't matter what order you do them in, as long as you use each place-value exactly once. — Potatoswatter, Jul 08 '13 at 12:53
I'm still getting the same error with this line `__m128i mask0 = _mm_and( in, bit0 );` `popcount.c|10|error: incompatible types when initializing type '__m128i' using type 'int'`. I added the flags `-msse -msse2 -msse3 -msse4` and I'm using the headers `#include #include #include #include ` What else could it be? — pandoragami, Jul 08 '13 at 13:05
I searched all the related headers for `_mm_and` and they don't have one for `__m128i` So I guess there is no way to mask `int` of that type. — pandoragami, Jul 08 '13 at 13:18
@PaulR @Potatoswatter. I updated EDIT 2 above. It compiles now although the output of the sum is something I'm not sure of. Any suggestions. I also wondered if it was possible to use a shuffle method to swap the lower 32-bits (0-31) with the next one (32-63) and do a popcount and then swap the lower 64-bits (0-63) with the upper and repeat the same for 32-bits? The `popcnt` call can only see the rightmost 32-bits at a time so 96-bits are ignored during each operation. Not very efficient if you ask me. — pandoragami, Jul 08 '13 at 16:02

score 10 · Accepted Answer · 2013-07-09T13:17:44.090

10

SSE 4 popcount for 16 8-bit values can be done in parallel this way:

#include <stdio.h>
#include <stdint.h>
#include <immintrin.h>

//----------------------------------------------------------------------------
//
// parallelPopcnt16bytes - find population count for 8-bit groups in xmm (16 groups)
//                         each byte of xmm result contains a value ranging from 0 to 8
//
static __m128i parallelPopcnt16bytes (__m128i xmm)
   {
    const __m128i mask4 = _mm_set1_epi8 (0x0F);
    const __m128i lookup = _mm_setr_epi8 (0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
   __m128i low, high, count;

   low = _mm_and_si128 (mask4, xmm);
   high = _mm_and_si128 (mask4, _mm_srli_epi16 (xmm, 4));
   count = _mm_add_epi8 (_mm_shuffle_epi8 (lookup, low), _mm_shuffle_epi8 (lookup, high));
   return count;
   }

//----------------------------------------------------------------------------

int main (void)
    {
    int index;
    __m128i testVector = _mm_set_epi8 (1, 2, 4, 8, 16, 32, 64, 128, 0, 1, 3, 7, 15, 31, 63, 127);
    __m128i counts = parallelPopcnt16bytes (testVector);

    printf ("population count for each byte:");
    for (index = 15; index >= 0; index--)
        {
        uint8_t *bytes = (void *) &counts;
        printf (" %d", bytes [index]);
        }
    printf ("\n");
    return 0;
    }

//----------------------------------------------------------------------------

edited Jul 09 '13 at 13:17

answered Jul 08 '13 at 16:05

This line `count = _mm_add_epi8 (_mm_shuffle_epi8 (lookup, low), _mm_shuffle_epi8 (lookup, high));` is giving me 2 errors. `error: incompatible type for argument 1 of '_mm_add_epi8'| error: incompatible type for argument 2 of '_mm_add_epi8'| `. I had to add `#include ` using GCC. – pandoragami Jul 08 '13 at 16:22
Did adding #include solve the problem? I only tested this code with Microsoft VS2012and mingw + gcc. – Jul 08 '13 at 18:39
1

Very good - I just coded up an almost identical routine but you beat me to it. Note that (on Intel CPUs at least) this only requires SSSE3 (for `PSHUFB`), not SSE4, i.e. `#include `. – Paul R Jul 08 '13 at 21:41
Nice and fast! Just in the comment, "each byte of xmm result contains a value ranging from 0 to *8*." – Potatoswatter Jul 08 '13 at 22:23
Thanks and thanks for the correction. I believe credit for this algirithm goes to Wojciech Mula. An AVX2 adaption is possible (http://notabs.org/blcutil/) and might even be faster that the popcnt instruction in some cases. – Jul 08 '13 at 23:04

Potatoswatter · Answer 2 · 2013-07-08T22:24:41.833

popcnt was introduced simultaneously with the SSE4.2 ISA extension but does not operate on SSE vector registers. You will need a separate instruction for each individual result.

Furthermore it's not defined for 8-bit operands. You will need to pad to 16 bits if you need a count for each individual byte.

You could sum 8 bytes at a time in 64-bit registers, but that doesn't sound like what you're after.

Reference: The SSE4 manual.

SSE2 solution.

I haven't tested this, but you could AND the SSE register with 0x80808080… to get a 16-byte mask of all 1's or all 0's. Repeat for all 8 bits in a byte, and sum the masks. Since all 1's represents -1 in two's complement, negate the 16 bytes, and you have all the results.

The AND and comparison operations should be able to run in parallel. The chain of additions is dependent but it should still run plenty fast, and it fits in 32 instructions. (Only 7 additions needed.)

/* init */
__m128i bit0 = _mm_set1_epi8( 0x80 );
__m128i mask0 = _mm_and_si128( in, bit0 );
__m128i sum = _mm_cmpeq_epi8( mask0, _mm_setzero_si128() );

/* general pattern */
__m128i bit1 = _mm_set1_epi8( 0x40 );
__m128i mask1 = _mm_and_si128( in, bit1 );
mask1 = _mm_cmpeq_epi8( mask1, _mm_setzero_si128() );
sum = _mm_add_epi8( sum, mask1 );

/* next bit */
__m128i bit2 = _mm_set1_epi8( 0x20 );
__m128i mask2 = _mm_and_si128( in, bit2 );
mask2 = _mm_cmpeq_epi8( mask2, _mm_setzero_si128() );
sum = _mm_add_epi8( sum, mask2 );

...

/* finish up */
sum = _mm_sub_epi8( _mm_setzero_si128(), sum );

@user2555139 Looping 8 times with `and`, compare to zero, and add, then negating the final result, should be able to produce all 16 results in 26 instructions, and fewer than 26 cycles since the loop iterations aren't dependent. — Potatoswatter, Jul 08 '13 at 04:01
Is there a possibility you could code that part for me please? I'm not exactly sure what intrinsics to use. Still new to this. — pandoragami, Jul 08 '13 at 04:11
@user2555139 see edit. Actually it's ~32 insns but who's counting. I still didn't even try compiling anything, please post if it works and how fast. — Potatoswatter, Jul 08 '13 at 04:58
I tried my best, see my post above for the error(s). Sorry I'm not competent enough to figure out the reasons for them. — pandoragami, Jul 08 '13 at 12:45

SSE 4 popcount for 16 8-bit values?

2 Answers2

SSE2 solution.

Linked