1

It is easy to find examples online on how to deal with conditional branching in SIMD C++ code, by using comparison intrinsics and masking. What I don't quite get, however, is what is the most efficient (including a special care with memory allocation and cache misses) way of conditionally return only part of the results generated by a calculation done with SIMD instead of all results.

For instance, suppose we have two float containers, a and b, each with size N, and to give a simple example, we just want a function that iterates over their values and sum they together. However, the function has to return only the results that were greater than zero.

Let's start by including what need:

#include <vector>
#include <random>
using namespace std;

Now, suppose that these are the two containers a and b of size N:

N = 10000;
vector<float> a(N);
vector<float> b(N);

default_random_engine randomGenerator(time(0));
uniform_real_distribution<float> diceroll(-1.0f, 1.0f);
for(int i-0; i<N; i++)
{
    a[i] = diceroll(randomGenerator);
    b[i] = diceroll(randomGenerator);
}

Next, a simple SIMD code for our function that just sums a and b together and returns all results could be:

vector<float> ourfunction( vector<float> _a, vector<float> _b )
{
    vector<float> results(N);
    const int aligendN = N - N % 4;
    for (int i = 0; i < alignedN; i+=4) {
            __m128 _a = _mm_load_ps(&a[i]);
            __m128 _b = _mm_load_ps(&b[i]);
            _mm_store_ps(results[i], _mm_add_ps(_a, _b));
    }

    return results;
}

Now, what if we want the function to return only the results that ended up being greater than zero? Sure, I see that we could just iterate over the vector<float> results and delete its elements that are lower or equal zero, or copy its elements that are greater than zero to a new temporary container and then return this one.

But these seem to be really cumbersome solutions and they are certainly quite inefficient. Thus, my question: what is the correct or most efficient way to do that? It is, to retrieve only some results from a calculation done with SIMD instead of all?

I'm interested in instruction sets from SSE2 to AVX2 (or even future stuff like AVX512), since CPU-dispatching is an option for my use-case.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Kim Shutter
  • 187
  • 1
  • 7
  • What you're looking for is called stream compaction. – user703016 Nov 18 '16 at 04:55
  • Err sorry, skimmed too quickly. You want to filter an array into another array. This is called left-packing. It's possible to do it very efficiently with AVX2 (see my answer on the linked duplicate), or reasonably efficiently by looking up a pshufb shuffle mask from a lookup table for SSSE3 or later (see the code in the question on the linked duplicate, and explanation in the very nice PDF it links to). – Peter Cordes Nov 18 '16 at 04:56
  • To apply that to your case, do the left-packing on the vector-results from `_mm_add_ps(_a, _b)`, so you're basically filtering your results into the output array (instead of filtering from a single input array). Oh, and don't use `_`-prefixed names for your own variables. They're [reserved for use by the implementation](http://stackoverflow.com/questions/228783/what-are-the-rules-about-using-an-underscore-in-a-c-identifier). – Peter Cordes Nov 18 '16 at 05:01
  • @PeterCordes Thanks for pointing me in the right direction. Just to clarify: you mention solutions for AVX2 and for SSE3 or later, but are there also solutions for SSE and SSE2? I thought the explanation given in that PDF would work for such cases too. – Kim Shutter Nov 18 '16 at 05:08
  • Note that SSSE3 (Core2 and later) is not the same as SSE3 (Pentium4 and later). Actually yes, I'd forgotten that the PDF shows a technique that uses only SSE2, with O(log2(vector_width)) steps because there's no variable-mask shuffle available until SSSE3. It's significantly slower than looking up a shuffle-control vector from a LUT, though, so if you're doing any kind of CPU-dispatching to take advantage of the instruction-set supported on your target CPU, you should do that here. (And take advantage of POPCNT, which was added along with SSE4.2 in Intel Nehalem). – Peter Cordes Nov 18 '16 at 05:17
  • 1
    @PeterCordes identifiers starting with `_` and not followed by a capital letter are only reserved in global scope. – user703016 Nov 18 '16 at 05:18
  • BTW, this is why it's useful to mention in your question which instruction-sets you care about, and tag with [tag:SSE] or [tag:AVX], not just [tag:simd]. For example, ARM's SIMD instruction set doesn't have an equivalent for MOVMSKPS, so you can't look up a shuffle mask on ARM at all. – Peter Cordes Nov 18 '16 at 05:20
  • 1
    @GundolfGundelfinger: right, I noticed that after reading some more of the link I found. I think it makes a lot more sense to just always avoid leading underscores (and double-underscores internally) in the first place. – Peter Cordes Nov 18 '16 at 05:21
  • 1
    @PeterCordes Thanks! I misread SSE3, but I think I see what you mean: probably a `leftpack` function that CPU-dispatches 4 types of codes would be usefull: one for pre-SSSE3 using Insomaniac's approach; one for SSSE3-to-AVX; one for AVX2 and one for AVX512. Only issue that I'm still confused between your second and third comments above: Insomaniac's solution is for SSSE3 or SSE2/SSE3? In the second comment you said the former, in the third comment you seem to have said the later. – Kim Shutter Nov 18 '16 at 05:27
  • @PeterCordes Oh, pretty good point about being more specific about the instrunction-sets I was interested in. In this case I was interested in multiple solutions for CPU-dispatching, but I will make sure to be more clear about that in the future! – Kim Shutter Nov 18 '16 at 05:28
  • I haven't read that PDF in a long time, I forget what's in it. What you said made me remember that it mentioned a fixed-shuffle algo, but still guessing/vaguely remembering here! If it doesn't use `_mm_shuffle_epi8` ([PSHUFB](http://www.felixcloutier.com/x86/PSHUFB.html)) or `_popcnt_u32` ([POPCNT](http://www.felixcloutier.com/x86/POPCNT.html)), there's probably nothing else that requires more than SSE2. You could check by trying to compile it with gcc with no `-march=` or `-mssse3` flags: gcc won't let you use intrinsics for instruction-sets you haven't enabled. – Peter Cordes Nov 18 '16 at 05:38
  • There are too many different versions of SSE/AVX instruction to tag everything you're interested in. IMO, best practice is to tag with SSE and/or AVX, and mention in the text which version of what you're interested in. I don't follow every single SSE tag, so I prob. wouldn't notice a question tagged `[sse4]` but at least one of `[x86]`, `[sse]`, or `[avx]`. I made a quick edit to your question to demonstrate the recommended way to do this :) – Peter Cordes Nov 18 '16 at 05:42
  • 1
    @PeterCordes Amazing job, I appreciate the edit so I can lear how to make better use of the site for me and also how to help other who find my questions by searching. Thanks! And about the details of Insomaniac's and yours implementations, I will be checkin both in more detail tomorrow. – Kim Shutter Nov 18 '16 at 06:01

0 Answers0