I've written an algorithm that does multiple single precision operations in parallel using Intel intrinsic functions. The result of each iteration of my algorithm is the number of nonzero entries in a single 256 bit vector (__m256
).
For example:
00000000 FFFFFFFF 00000000 00000000 00000000 FFFFFFFF FFFFFFFF FFFFFFFF
where the result of the iteration is 4.
What is the fastest way to count the number nonzero entries in the vector?
Currently I'm doing something like this:
float results[8];
_mm256_storeu_ps(results, result_vector);
int count = 0;
for (uint32_t idx = 0; idx < 8; ++idx)
{
if (results[idx] != 0)
{
++count;
}
}
This approach works just fine but I wonder if there is a more efficient way to do it, perhaps one that doesn't involve a store.