how to use SSE to process array of ints, using a condition

Question

I'm new to SSE, and limited in knowledge. I'm trying to vectorize my code (C++, using gcc), which is actually quite simple. I have an array of unsigned ints, and I only check for elements that are >=, or <= than some constant. As result, I need an array with elements that passed condition. I'm thinking to use 'mm_cmpge_ps' as a mask, but this construct work over floats not ints!? :(

any suggestion, help is very much appreciated.

See `_mm_cmpgt_epi32`. It is a signed comparison, but you can work around that. But do you want compaction, or just put some other number in the positions that didn't pass the check? — harold, Jun 14 '13 at 09:36
well, if it hurts performance, that I can live without compaction, will simply put '0', or you have a better suggestion? — Nik Kovac, Jun 14 '13 at 09:59
`gcc -O3` with a suitable `-march=...` flag vectorizes `void f(unsigned *p,unsigned l, int n){ for(int i=0;i — Marc Glisse, Jun 14 '13 at 10:36

Paul R · Accepted Answer · 2013-06-14T17:51:59.367

2

It's pretty easy to just mask out (i.e. set to 0) all non-matching ints. e.g.

#include <emmintrin.h>    // SSE2 intrinsics

for (int i = 0; i < N; i += 4)
{
    __m128i v = _mm_load_si128(&a[i]);
    __m128i vcmp0 = _mm_cmpgt_epi32(v, _mm_set1_epi32(MIN_VAL - 1));
    __m128i vcmp1 = _mm_cmplt_epi32(v, _mm_set1_epi32(MAX_VAL + 1));
    __m128i vcmp = _mm_and_si128(vcmp0, vcmp1);
    v = _mm_and_si128(v, vcmp);
    _mm_store_si128(&a[i], v);
}

Note that a needs to be 16 byte aligned and N needs to be a multiple of 4 - if these constraints are a problem then it's not too hard to extend the code to cope with this.

edited Jun 14 '13 at 17:51

answered Jun 14 '13 at 10:21

Paul R

208,748
37
389
560

which header should I use in order to try this code (GCC)? __m256i is a new type that I first see it here. Are this instructions AVX? my compiler fails to include gmmintrin.h for AVX. Also, can you explain what does 'MIN_VAL - 1' do? as well as MAX_VAL - 1? – Nik Kovac Jun 14 '13 at 10:58
can you explain, why are you using once __m128i vcmp, and __m256i for comparing vectors? – Nik Kovac Jun 14 '13 at 12:10
Sorry - typos - I've been working with AVX/AVX2 a lot lately so my fingers were automatically typing 256 instead of 128 - fixed now. I've also added the relevant #include for SSE2. – Paul R Jun 14 '13 at 13:05
I just added an answer which used your code for one function. – Jun 14 '13 at 13:09
1

Wouldn't it be better if the set1's were outside the loop, or do compilers do that automatically these days? – harold Jun 14 '13 at 13:25
@harold: I've never found a compiler that fails to optimise this, but yes, if you want to play it safe you could initialise these constant vectors prior to the loop. – Paul R Jun 14 '13 at 13:27
You are assuming that the array is aligned and that N is a multiple of 4. Also, since the OP is using gcc (hopefully a recent one), after `__v4si *b = (__v4si*) a;` the interior of the loop can be rewritten as `__v4si v = *b; *b = v & (v >= MIN_VAL) & (v <= MAX_VAL);`. – Marc Glisse Jun 14 '13 at 17:35
@Marc: true - I should have noted the alignment and array size constraints. I disagree regarding the gcc-specific syntax though on the grounds of portability. – Paul R Jun 14 '13 at 17:50
@PaulR: agreed for portability, but it is so much more readable... Please complain to vendors who don't support a readable syntax. – Marc Glisse Jun 14 '13 at 17:58
@Marc: well the gcc syntax only really works for a small subset of possible SIMD routines - anything beyond a relatively trivial function generally needs to be written with intrinsics anyway, so you have to "bite the bullet" at some point. Also, for trivial functions you can often get auto-vectorization to work. – Paul R Jun 14 '13 at 18:26
@PaulR: it's the same as with scalars, you write `n+1`, not `scalar_plus_int32(n,scalar_load_int32_cst(1))`, for a small subset of operations, and for the others (say factorial) you call a function. Writing vector code with gcc's extension and just a couple intrinsics here and there already helps readability a lot. And it makes it easier to write code that works with SSEx but benefits from SSEy if compiled for it. – Marc Glisse Jun 14 '13 at 21:31
@Marc: yes, I can see the attraction, but the gcc syntax only maps to a subset of SIMD operations - it doesn't handle packing/unpacking, permutations/shuffles, fixed point math, reductions, etc - it's fine for very simple fixed element width stuff, but for most real world SIMD stuff it doesn't really help. – Paul R Jun 15 '13 at 05:22
@PaulR: it does handle permutations via __builtin_shuffle (clang has a slightly different name and syntax). You can even write `vec w={v[1],v[0]};` and gcc will detect the shuffle. – Marc Glisse Jun 15 '13 at 07:31
@Marc: interesting about the shuffle syntax - I didn't know about that. – Paul R Jun 15 '13 at 07:50

score 1 · Answer 2 · edited May 23 '17 at 10:25

Here you go. Here are three functions.

The first function,foo_v1, is based on Paul R's answer.

The second function,foo_v2, is based on a popular question today Fastest way to determine if an integer is between two integers (inclusive) with known sets of values

The third function, foo_v3 uses Agner Fog's vectorclass which I added only to show how much easier and cleaner it is to use his class. If you don't have the class then just comment out the #include "vectorclass.h" line and the foo_v3 function. I used Vec8ui which means it will use AVX2 if available and break it into two Vec4ui otherwise so you don't have to change your code to get the benefit of AVX2.

#include <stdio.h>
#include <nmmintrin.h>                 // SSE4.2
#include "vectorclass.h"

void foo_v1(const int N, int *a, const int MAX_VAL, const int MIN_VAL) {
    for (int i = 0; i < N; i += 4) {
        __m128i v = _mm_load_si128((const __m128i*)&a[i]);
        __m128i vcmp0 = _mm_cmpgt_epi32(v, _mm_set1_epi32(MIN_VAL - 1));
        __m128i vcmp1 = _mm_cmplt_epi32(v, _mm_set1_epi32(MAX_VAL + 1));
        __m128i vcmp = _mm_and_si128(vcmp0, vcmp1);
        v = _mm_and_si128(v, vcmp);
        _mm_store_si128((__m128i*)&a[i], v);
    }
}

void foo_v2(const int N, int *a, const int MAX_VAL, const int MIN_VAL) {
    //if ((unsigned)(number-lower) < (upper-lower))
    for (int i = 0; i < N; i += 4) {
        __m128i v = _mm_load_si128((const __m128i*)&a[i]);
        __m128i dv = _mm_sub_epi32(v, _mm_set1_epi32(MIN_VAL));
        __m128i min_ab = _mm_min_epu32(dv,_mm_set1_epi32(MAX_VAL-MIN_VAL));
        __m128i vcmp = _mm_cmpeq_epi32(dv,min_ab);
        v = _mm_and_si128(v, vcmp);
        _mm_store_si128((__m128i*)&a[i], v);
    }
}

void foo_v3(const int N, int *a, const int MAX_VAL, const int MIN_VAL) {
    //if ((unsigned)(number-lower) < (upper-lower))
    for (int i = 0; i < N; i += 8) {
        Vec8ui va = Vec8ui().load(&a[i]);
        va &= (va - MIN_VAL) <= (MAX_VAL-MIN_VAL);
        va.store(&a[i]);
    }
}

int main() {
    const int N = 16;
    int* a = (int*)_mm_malloc(sizeof(int)*N, 16);
    for(int i=0; i<N; i++) {
        a[i] = i;
    }
    foo_v2(N, a, 7, 3);
    for(int i=0; i<N; i++) {
        printf("%d ", a[i]);
    } printf("\n");
    _mm_free(a);
}

Nice - it would be interesting to benchmark all three - my guess is that there won't be much difference. — Paul R, Jun 14 '13 at 13:29
Probably not, I tried to do foo_v2 in less instructions but was not able to. I wanted to use only _mm_cmplt_epi32 and not _mm_min_epu32 and _mm_cmpeq_epi32. — , Jun 14 '13 at 13:34
No problem, I'm still learning SSE/AVX myself, I suggest looking at the code in the vectorclass. That can answer 90% of your questions. The rest you can find on SO. — , Jun 14 '13 at 13:43
is there a way to count the number of such '0' elements using SSE? — Nik Kovac, Jun 14 '13 at 14:15
@PaulR, I forgot to mention, that if one has XOP, the >= can be done in one instruction with _mm_comge_epu32(a,b). That would be one instruction less. — , Jun 14 '13 at 18:54
@NikKovac, why don't you ask that as a question for others to try and answer. You could probably do it with _mm_movemask_epi8 among other options. — , Jun 14 '13 at 18:56
Yes, there are several neat ways of counting masked (or non-masked) elements - that would make a good (separate) question. — Paul R, Jun 15 '13 at 14:15

score 0 · Answer 3 · answered Jun 14 '13 at 09:34

0

First place to look might be Intel® Intrinsics Guide

answered Jun 14 '13 at 09:34

Kevin MOLCARD

2,168
3
22
35

how to use SSE to process array of ints, using a condition

3 Answers3