Comparison with NaN using AVX

Question

I am trying to create a fast decoder for BPSK using the AVX intrinsics of Intel. I have a set of complex numbers that are represented as interleaved floats, but due to the BPSK modulation only the real part (or the even indexed floats) are needed. Every float x is mapped to 0, when x < 0 and to 1 if x >= 0. This is accomplished using the following routine:

static inline void
normalize_bpsk_constellation_points(int32_t *out, const complex_t *in, size_t num)
{
    static const __m256             _min_mask = _mm256_set1_ps(-1.0);
    static const __m256             _max_mask = _mm256_set1_ps(1.0);
    static const __m256             _mul_mask = _mm256_set1_ps(0.5);

    __m256                          res;
    __m256i                         int_res;

    size_t i;
    gr_complex                      temp;
    float                           real;

    for(i = 0; i < num; i += COMPLEX_PER_AVX_REG){
            res = _mm256_load_ps((float *)&in[i]);

            /* clamp them to avoid segmentation faults due to indexing */
            res = _mm256_max_ps(_min_mask, _mm256_min_ps(_max_mask, res));

            /* Scale accordingly for proper indexing -1->0, 1->1 */
            res = _mm256_add_ps(res, _max_mask);
            res = _mm256_mul_ps(res, _mul_mask);

            /* And then round to the nearest integer */
            res = _mm256_round_ps(res, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC);

            int_res = _mm256_cvtps_epi32(res);

            _mm256_store_si256((__m256i *) &out[2*i], int_res);
    }
}

Firstly, I clamp all the received floats in the range [-1, 1]. Then after some proper scaling, the result is rounded to the nearest integer. That will map all floats above 0.5 to 1 and all floats below 0.5 to 0.

The procedure works fine if the input floats are normal numbers. However, due to some situations at previous stages, there is a possibility that some input floats are NaN or -NaN. At this case, 'NaN' numbers are propagated through the _mm256_max_ps(), _mm256_min_ps() and all other AVX functions resulting to an integer mapping of -2147483648 which of course causes my program to crash due to invalid indexing.

Is there any workaround to avoid this problem, or at least set the NaN to 0 using AVX?

`(float *)&in[i]` casts a `complex *` to `float *`. Invitation to undefined behaviour. — too honest for this site, Aug 04 '15 at 20:34
No, `complex_t` is pointing to a memory region with interleaved floats representing a complex number. — Manos, Aug 04 '15 at 20:36
Still an aliasing issue imo. And the layout of complex_t is implementation defined. — too honest for this site, Aug 04 '15 at 20:42
@Olaf OP is already committed to compiler intrinsics, so I doubt that implementation-defined is a problem for him. — Nicu Stiurca, Aug 04 '15 at 20:48
@SchighSchagh: Not that sure. IIRC, gcc and MSVC use different alignment - thus sizeof - for some floats. But that might just affect `long double`. — too honest for this site, Aug 04 '15 at 20:50
@Olaf, even if I have used the `std::complex` for the complex representation, which is constructed by an `std::vector` of two floats, It wouldn't be a problem, because successive elements of scalar objects are guarantee to be continuous in memory in the `std::vector` class. — Manos, Aug 04 '15 at 20:53
Just to point it out, the `NaN`s are produced due to the equalizing of a false detection of frame and not from a wrong alignment of initialization of a buffer. — Manos, Aug 04 '15 at 20:57
Sorry, I really had absolutely no idea that C now has C++ elements. But even with those, the `sizeof` of a type is not identical with the width. There may very well be padding bits or bytes withtin the size given. So, a complex may very well have padding added, as the `sizeof` is about the whole type, not single elements. — too honest for this site, Aug 04 '15 at 20:58
C has complex types since 1999 and they are guaranteed to have the same layout as a two element array of the corresponding real type. So there should be no padding or alignment problems. — Jens Gustedt, Aug 04 '15 at 21:09
@Olaf yeah I didn't tag it, you are right. The actual project is C++ but my routine and other parts are C. Nevertheless, in SDRs (Software Defined Radios) the above casting from complex to their interleaved floats form is very common and convenient. So I used every day in many cases. — Manos, Aug 04 '15 at 21:09
"complex as a pair of floats" is a well-known area in which C++ might have benefited from more discussion with numerical experts. — MSalters, Aug 05 '15 at 08:40
@MSalters i agree. The majority of operations on complex numbers (addition, subtraction, squared magnitude and several others) is faster if you threat them as a sequence of interleaved floats where even indexed correspond to the real and odd indexed to the imaginary part. This representation speeds-up also SIMD implementations a lot. — Manos, Aug 05 '15 at 08:49
@Manos: Actually, that particular part shouldn't be an issue as it's up to the implementation to handle it properly. The people who write `std::complex::operator+` generally are the same people who decide on the representation of `std::complex`. The biggest issue there is that many compiler vendors have separate library teams. GCC in particular seems to keeps its C++ library at two arms length. For the complex type, this might not be ideal. — MSalters, Aug 05 '15 at 08:59

harold · Accepted Answer · 2015-08-04T20:59:55.187

5

You could do it the simple way to begin with, compare and mask: (not tested)

res = _mm256_cmp_ps(res, _mm256_setzero_ps(), _CMP_NLT_US);
ires = _mm256_srl_epi32(_mm256_castps_si256(res), 31);

Or shift and xor: (also not tested)

ires = _mm256_srl_epi32(_mm256_castps_si256(res), 31);
ires = _mm256_xor_epi32(ires, _mm256_set1_epi32(1));

This version will also care about the sign of NaN (and ignore the NaN-ness).

Alternative for no AVX2 (not tested)

res = _mm256_cmp_ps(res, _mm256_setzero_ps(), _CMP_NLT_US);
res = _mm256_and_ps(res, _mm256_set1_ps(1.0f));
ires = _mm256_cvtps_epi32(res);

edited Aug 04 '15 at 20:59

answered Aug 04 '15 at 20:42

harold

61,398
6
86
164

Hmm I think you are right. I will check the `AVX` version and I will come back for the results ;) – Manos Aug 04 '15 at 20:48
@Manos oh wait, I just noticed, you wanted *integer* 0 and 1? I'll change the code – harold Aug 04 '15 at 20:55
1

You could cheat a bit more and use the float that has the same bitpattern as (int)1 and then write that into memory, no cvt necessary – harold Aug 04 '15 at 21:47

score 4 · Answer 2 · answered Aug 11 '15 at 12:00

Harold posted a good solution for the question you were really asking, but I want to make clear that eliminating NaN values while clamping is totally straightforward. If either argument is a NaN, MINPS and MAXPS simply return the second argument. So all you need to do is swap the argument order and NaNs will be clamped as well. For example, the following would clamp NaNs to _min_mask:

res = _mm256_max_ps(_mm256_min_ps(_max_mask, res), _min_mask);

Comparison with NaN using AVX

2 Answers2

Linked