5

I've written an algorithm that does multiple single precision operations in parallel using Intel intrinsic functions. The result of each iteration of my algorithm is the number of nonzero entries in a single 256 bit vector (__m256).

For example:

 00000000  FFFFFFFF  00000000  00000000  00000000  FFFFFFFF  FFFFFFFF  FFFFFFFF

where the result of the iteration is 4.

What is the fastest way to count the number nonzero entries in the vector?

Currently I'm doing something like this:

float results[8];
_mm256_storeu_ps(results, result_vector);

int count = 0;
for (uint32_t idx = 0; idx < 8; ++idx)
{
    if (results[idx] != 0)
    {            
        ++count;
    }
}

This approach works just fine but I wonder if there is a more efficient way to do it, perhaps one that doesn't involve a store.

Paul R
  • 208,748
  • 37
  • 389
  • 560
Dave
  • 427
  • 4
  • 14
  • Are the non-zero entries guaranteed to be `0xFFFFFFFF`? If so, one idea would be to AND with a mask to isolate the least significant bit in each 32-bit section, then apply sum of absolute differences. – njuffa Nov 14 '17 at 17:23
  • 3
    Or just compare with zero (`_mm256_cmp_ps`), extract the bit mask (`_mm256_movemask_ps`) and use `popcnt` to count the bits ? Three instructions. – Paul R Nov 14 '17 at 17:37
  • 2
    If they're already 0 / 0xFFF... (i.e. the result of a compare) you can skip the `cmpps` step and just movemask / popcnt. – Peter Cordes Nov 14 '17 at 18:17
  • 1
    @PaulR I was unaware of the movemask intrinsic; that + popcnt works perfectly, thanks! – Dave Nov 14 '17 at 18:24
  • 1
    @PeterCordes yes, the results end up either 0 or all 0xF so no need for a compare – Dave Nov 14 '17 at 18:24
  • If the code in the question is itself within a loop, you may want to accumulate several results in vector registers using `vpsubd` into an accumulator (using sub on `0xFFFFFFFF` results in adding 1). This is the cheapest way on a per-vector basis since it's only 1 uop, 1 cycle latency and 3/cycle throughput. Every few iterations you can then use `vmovmsk` to collect the results into an integeger register and use `popcnt` or whatever. – BeeOnRope Nov 14 '17 at 23:50

1 Answers1

9

The hardware popcnt instruction is your best bet here. It's fast, and vmovmskps is also very efficient for giving you the high bit of each element as an integer bitmask. (compare / movemask is a standard way to branch on a vector compare result, or use it to index a lookup table of shuffle masks).

movemask / popcnt can be useful when left-packing, to increment a destination pointer by the number of elements you stored (after shuffling).

#include <immintrin.h>

// use only with compare-results.
// or to count elements with their sign-bit set
unsigned count_true(__m256 v) {
    unsigned mask = _mm256_movemask_ps(v);
    return _mm_popcnt_u32(mask);
}

popcnt has a separate feature-bit from AVX, so in theory there could be a CPU (or virtual machine) with AVX but not hardware popcnt, but in practice I wouldn't worry about it. (popcnt was introduced with SSE4.2, and AVX implies SSE4.2)


Even if you want the result in a vector register for something, vmovmskps / popcnt / movd is probably a better sequence than horizontally adding the 0 / -1 elements with integer adds. That would take 3 shuffle/add steps to reduce 8 elements down to 1, and you'd have a negative sum.

I mostly mention this because treating compare results as integer 0 / -1 is useful in some cases. e.g. to conditionally increment a vector of counters, cmpps / psubd does the trick. (0 + x = x, so the false elements are unchanged.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847