AVX2 float compare and get 0.0 or 1.0 instead of all-0 or all-one bits

Question

Basically, in the resulting vector, I want to save 1.0 for all input floating point values > 1, while 0.0 for all input floating point values <= 1. Here is my code,

float f[8] = {1.2, 0.5, 1.7, 1.9, 0.34, 22.9, 18.6, 0.7};
float r[8]; // Must be {1, 0, 1, 1, 0, 1, 1, 0}

__m256i tmp1 = _mm256_cvttps_epi32(_mm256_loadu_ps(f));
__m256i tmp2 = _mm256_cmpgt_epi32(tmp1, _mm256_set1_epi32(1));
_mm256_store_ps(r, _mm256_cvtepi32_ps(tmp2));

for(int i = 0; i < 8; i++)
    std::cout << f[i] << " : " << r[i] << std::endl;

But I don't get the correct results. This is what I get. Why aren't AVX2 relational operations working properly for me?

1.2 : 0
0.5 : 0
1.7 : 0
1.9 : 0
0.34 : 0
22.9 : -1
18.6 : -1
0.7 : 0

From the comment it seems like you just want to compare floats and get 1.0 if true, that's just `_mm256_cmp_ps` and `_mm256_and_ps`, casting to ints changes the values so you get a completely different result. Eg `(int)1.2 = 1` and therefore not bigger than 1 — harold, Apr 29 '17 at 19:34
I'm certain I saw just this question just last week, but I can't find it. — MSalters, Apr 29 '17 at 22:10

score 5 · Accepted Answer · edited Aug 07 '17 at 08:33

I think it's better to use _mm256_cmp_ps for your question. I have implemented the following program for this purpose. This is more than what you want. If you want to save ones you should set all mask elements to 1, but if you want to save another number you can change the mask value to whatever you want.

//gcc 6.2, Linux-mint, Skylake 
#include <stdio.h>
#include <x86intrin.h>

float __attribute__(( aligned(32))) f[8] = {1.2, 0.5, 1.7, 1.9, 0.34, 22.9, 18.6, 1.0};
// float __attribute__(( aligned(32))) r[8]; // Must be {1, 0, 1, 1, 0, 1, 1, 0}
// in C++11, use alignas(32).  Or C11 _Alignas(32), instead of GNU C __attribute__.

void printVecps(__m256 vec)
{
    float tempps[8];
    _mm256_store_ps(&tempps[0], vec);
    printf(" [0]=%3.2f, [1]=%3.2f, [2]=%3.2f, [3]=%3.2f, [4]=%3.2f, [5]=%3.2f, [6]=%3.2f, [7]=%3.2f \n",
    tempps[0],tempps[1],tempps[2],tempps[3],tempps[4],tempps[5],tempps[6],tempps[7]) ;

}

int main()
{

    __m256 mask = _mm256_set1_ps(1.0), vec1, vec2, vec3;

    vec1 = _mm256_load_ps(&f[0]);                   printf("vec1 : ");printVecps(vec1); // load vector values from f[0]-f[7]
    vec2 = _mm256_cmp_ps ( mask, vec1, _CMP_LT_OS /*0x1*/);
                                                    printf("vec2 : ");printVecps(vec2); // compare them to mask (less)
    vec3 = _mm256_min_ps (vec2 , mask);             printf("vec3 : ");printVecps(vec3); // select minimum from mask and compared results

    return 0;
}

The output for mask = {1,1,1,1,1,1,1,1} is :

vec1 :  [0]=1.20, [1]=0.50, [2]=1.70, [3]=1.90, [4]=0.34, [5]=22.90, [6]=18.60, [7]=1.00 
vec2 :  [0]=-nan, [1]=0.00, [2]=-nan, [3]=-nan, [4]=0.00, [5]=-nan, [6]=-nan, [7]=0.00 
vec3 :  [0]=1.00, [1]=0.00, [2]=1.00, [3]=1.00, [4]=0.00, [5]=1.00, [6]=1.00, [7]=0.00

And for mask = {2,2,2,2,2,2,2,2} is :

vec1 :  [0]=1.20, [1]=0.50, [2]=1.70, [3]=1.90, [4]=0.34, [5]=22.90, [6]=18.60, [7]=1.00 
vec2 :  [0]=0.00, [1]=0.00, [2]=0.00, [3]=0.00, [4]=0.00, [5]=-nan, [6]=-nan, [7]=0.00 
vec3 :  [0]=0.00, [1]=0.00, [2]=0.00, [3]=0.00, [4]=0.00, [5]=2.00, [6]=2.00, [7]=0.00

This depends on the non-commutative behaviour of _mm256_min_ps with NaNs to replace the NaN elements with 1.0. NaN > 1.0 : NaN : 1.0 = 1.0, because NaN > anything is always false.

Beware that gcc before 7.0 treats the 128b _mm_min_ps intrinsic as commutative even without -ffast-math (even though it knows the minps instruction isn't). Use an up-to-date gcc, or make sure that gcc chooses to compile your code with the operands in the order needed by this algorithm. (Or use clang). It's possible that gcc won't ever swap the operands with AVX, only with SSE (to avoid extra movapd instructions), but the safest thing is to use gcc7 or later.

There are symbolic names for [AVX `cmp_ps` predicates](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=5452,4067,4729,94,76,270,3622,3874,3984,2679,1902,1234,3603,3517,701&text=cmpps): 1 is`_CMP_LT_OS` — Peter Cordes, Aug 07 '17 at 07:07
`__attribute__(( aligned(32)))` can be done in portable C++11 with `alignas(32)`. Or in portable C11 with `_Alignas(32)`. Also, I'd suggest using `__m256 mask = _mm256_set1_ps(1.0f)` instead of a braced initializer. — Peter Cordes, Aug 07 '17 at 07:10

Jonas · Answer 2 · 2017-04-29T19:59:23.703

When a float is converted to int using _mm256_cvttps_epi32 then the integer returned is a truncated (round towards zero) value. That is the values 1.2, 1.7, and 1.9 are converted to 1, and they are thus not greater than 1.

The output of _mm256_cmpgt_epi32 is not 1 but "all ones", from the docs:

... if the s1 data element is greater than the corresponding element in s2, then the corresponding element in the destination vector is set to all 1s.

"All ones" is when using two's-complement integers, as your results show, minus one.

Off topic:

Why do you use an unaligned load and an aligned store?
You should take a look at _mm256_cmp_ps

AVX2 float compare and get 0.0 or 1.0 instead of all-0 or all-one bits

2 Answers2

Linked