SSE optimisation for a loop that finds zeros in an array and toggles a flag + updates another array

Question

A piece of C++ code determines the occurances of zero and keeps a binary flag variable for each number that is checked. The value of the flag toggles between 0 and 1 each time a zero is encountered in a 1 dimensional array.

I am attempting to use SSE to speed it up, but I am unsure of how to go about this. Evaluating the individual fields of __m128i is inefficient, I've read.

The code in C++ is:

int flag = 0;
int var_num2[1000];
for(int i = 0; i<1000; i++)
{  
    if (var[i] == 0)
    {
        var_num2[i] = flag;
        flag = !flag;  //toggle value upon encountering a 0
     }
}

How should I go about this using SSE intrinsics?

Trying to get faster than a "not" on a var that is probably already in a register is... optimistic. — Mat, Dec 18 '17 at 09:27
If `var` is the input, what is the role of `var_num2` in the implementation? — Codor, Dec 18 '17 at 09:28
@Mat, how would you do it instead? Without accessing individual fields of an __m128i variable. — Blue, Dec 18 '17 at 09:31
The algorithm is sequential in nature; I doubt that it can be sped up using SSE. — Codor, Dec 18 '17 at 09:43
Can you please elaborate? I am new to SSE. I was attempting to have a for loop `for(int i =0; i <1000; i+=4)`, and I actually have a pointer and not an array for variable named `var`. So I can perform an operation on 4 locations at once. — Blue, Dec 18 '17 at 09:46
@Mat: True, but realistically the only way to go faster than a `NOT` on a single value is to do so in parallel, and SSE is the logical way of doing that. The problem here is that output `var_num2[i]` has a dependency on `var[0]..var[i]` — MSalters, Dec 18 '17 at 10:00
@Codor: It isn't actually sequential, despite the looks of it. It's just another divide-and-conquer O(N log N) problem. — MSalters, Dec 18 '17 at 10:11
Also, in Visual Studio you can enable SSE code generation and see if the compiler actually tried to generate something interesting. — StarShine, Dec 18 '17 at 10:15
@StarShine: We have a decent idea how the SSE code generation works in Visual Studio, due to the `-Qvec-report` option. I expect this to fail with `-Qvec-report` reason 1000; `flag` is a data-dependency between loop iterations. That's why my answer does away with `flag`. — MSalters, Dec 18 '17 at 13:35
@MSalters The partial sum approach is indeed a terrific answer, but I'm sure there are other tricks to count the occurance of zero's in a bitstream, which is what this requires. If latency permits, instead of using SSE, you can use GPU processing to run your partial sum across 128 lanes (and more) in parallel on vastly larger numbers, but I digress. — StarShine, Dec 18 '17 at 17:58

score 3 · Answer 1 · answered Dec 18 '17 at 10:10

3

You'd have to recognize the problem, but this is a variation of a well-known problem. I'll first give a theoretical description

Introduce a temporary array not_var[] which contains 1 if var contains 0 and 0 otherwise. Introduce a temporary array not_var_sum[] which holds the partial sum of not_var. var_num2 is now the LSB of not_var_sum[]

The first and third operation are trivially parallelizable. Parallelizing a partial sum is only a bit harder.

In a practical implementation, you wouldn't construct not_var[], and you'd write the LSB directly to var_num2 in all iterations of step 2. This is valid because you can discard the higher bits. Keeping just the LSB is equivalent to taking the result modulo 2, and (a+b)%2 == ((a%2) + (b%2))%s.

answered Dec 18 '17 at 10:10

MSalters

173,980
10
155
350

Do you mean 'and `1` otherwise'? – Codor Dec 18 '17 at 10:14
@Codor: No. In code, `not_var[i] = (var[i]==0)?:1:0` – MSalters Dec 18 '17 at 10:15
This is a good idea, but the catch is that the initial value of the variable `flag` is set by a condition dependent on a different variable. And once that is set, it needs to toggle. – Blue Dec 18 '17 at 11:32
@Blue: Minor detail, just invert the output. To be clear: this is just the Comp.Sci approach, by the usual process of reducing it to a solved problem. – MSalters Dec 18 '17 at 11:56
@Blue: If zeros in `var[]` are rare, then you might use a one-pass algorithm that optimistically scans through whole vectors of non-zero, and falls back to serialized code when there's a zero. – Peter Cordes Dec 18 '17 at 13:02

score 0 · Answer 2 · answered Dec 21 '17 at 17:40

What type are the elements of var[]? int? Or char? Are zeroes frequent?

A SIMD prefix sum aka partial is possible (with log2(vector_width) work per element, e.g. 2 shuffles and 2 adds for a vector of 4 float), but the conditional-store based on the result is the other major problem. (Your array of 1000 elements is probably too small for multi-threading to be profitable.)

An integer prefix-sum is easier to do efficiently, and the lower latency of integer ops helps. NOT is just adding without carry, i.e. XOR, so use _mm_xor_si128 instead of _mm_add_ps. (You'd be using this on the integer all-zero/all-one compare result vector from _mm_cmpeq_epi32 (or epi8 or whatever, depending on the element size of var[]. You didn't specify, but different choices of strategy are probably optimal for different sizes).

But, just having a SIMD prefix sum actually barely helps: you'd still have to loop through and figure out where to store and where to leave unmodified.

I think your best bet is to generate a list of indices where you need to store, and then

for (size_t j = 0 ; j < scatter_count ; j+=2) {
    var_num2[ scatter_element[j+0] ] = 0;
    var_num2[ scatter_element[j+1] ] = 1;
}

You could generate the whole list if indices up-front, or you could work in small batches to overlap the search work with the store work.

The prefix-sum part of the problem is handled by alternately storing 0 and 1 in an unrolled loop. The real trick is avoiding branch mispredicts, and generating the indices efficiently.

To generate scatter_element[], you've transformed the problem into left-packing (filtering) an (implicit) array of indices based on the corresponding _mm_cmpeq_epi32( var[i..i+3], _mm_setzero_epi32() ). To generate the indices you're filtering, start with a vector of [0,1,2,3] and add [4,4,4,4] to it (_mm_add_epi32). I'm assuming the element size of var[] is 32 bits. If you have smaller elements, this require unpacking.

BTW, AVX512 has scatter instructions which you could use here, otherwise doing the store part with scalar code is your best bet. (But beware of Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake when just storing without loading.)

To overlap the left-packing with the storing, I think you want to left-pack until you have maybe 64 indices in a buffer. Then leave that loop and run another loop that left-packs indices and consumes indices, only stopping if your circular buffer is full (then just store) or empty (then just left-pack). This lets you overlap the vector compare / lookup-table work with the scatter-store work, but without too much unpredictable branching.

If zeros are very frequent, and var_num2[] elements are 32 or 64 bits, and you have AVX or AVX2 available, you could consider doing an standard prefix sum and using AVX masked stores. e.g. vpmaskmovd. Don't use SSE maskmovdqu, though: it has an NT hint, so it bypasses and evicts data from cache, and is quite slow.

Also, because your prefix sum is mod 2, i.e. boolean, you could use a lookup table based on the packed-compare result mask. Instead of horizontal ops with shuffles, use the 4-bit movmskps result of a compare + a 5th bit for the initial state as an index to a lookup table of 32 vectors (assuming 32-bit element size for var[]).

SSE optimisation for a loop that finds zeros in an array and toggles a flag + updates another array

2 Answers2