It is easy to find examples online on how to deal with conditional branching in SIMD C++ code, by using comparison intrinsics and masking. What I don't quite get, however, is what is the most efficient (including a special care with memory allocation and cache misses) way of conditionally return only part of the results generated by a calculation done with SIMD instead of all results.
For instance, suppose we have two float containers, a
and b
, each with size N
, and to give a simple example, we just want a function that iterates over their values and sum they together. However, the function has to return only the results that were greater than zero.
Let's start by including what need:
#include <vector>
#include <random>
using namespace std;
Now, suppose that these are the two containers a
and b
of size N
:
N = 10000;
vector<float> a(N);
vector<float> b(N);
default_random_engine randomGenerator(time(0));
uniform_real_distribution<float> diceroll(-1.0f, 1.0f);
for(int i-0; i<N; i++)
{
a[i] = diceroll(randomGenerator);
b[i] = diceroll(randomGenerator);
}
Next, a simple SIMD code for our function that just sums a
and b
together and returns all results could be:
vector<float> ourfunction( vector<float> _a, vector<float> _b )
{
vector<float> results(N);
const int aligendN = N - N % 4;
for (int i = 0; i < alignedN; i+=4) {
__m128 _a = _mm_load_ps(&a[i]);
__m128 _b = _mm_load_ps(&b[i]);
_mm_store_ps(results[i], _mm_add_ps(_a, _b));
}
return results;
}
Now, what if we want the function to return only the results that ended up being greater than zero? Sure, I see that we could just iterate over the vector<float> results
and delete its elements that are lower or equal zero, or copy its elements that are greater than zero to a new temporary container and then return this one.
But these seem to be really cumbersome solutions and they are certainly quite inefficient. Thus, my question: what is the correct or most efficient way to do that? It is, to retrieve only some results from a calculation done with SIMD instead of all?
I'm interested in instruction sets from SSE2 to AVX2 (or even future stuff like AVX512), since CPU-dispatching is an option for my use-case.