SSE rms calculation

Question

I want to calculation the rms with the Intel sse intrinsic. Like this:

float rms( float *a, float *b , int l)
{
    int n=0;
    float r=0.0;
    for(int i=0;i<l;i++)
    {
        if(finitef(a[i]) && finitef(b[i]))
        {
            n++;
            tmp = a[i] - b[i];
            r += tmp*tmp;
        }
    }
    r /= n;
    return r;
}

But how to check which elements are NaN? And how to count n?

_mm_cmpord_ps(a[i],b[i]) maybe? And you will count n with n+=4. Note that compilers can vectorize this loop themselves. Note also that finitef does not exactly check for NaNs. — Marc Glisse, Apr 09 '13 at 21:24
Note that this function does not calculate RMS - it calculates mean square difference. — Paul R, Apr 10 '13 at 08:55
you are right, but the sqrt in SSE is a problem too, i think :-/ — Roby, Apr 11 '13 at 16:46
Well you only need the square root of a single value, so you can just call `fsqrtf` on the final mean square value, and so long as you are processing a large enough number of points then the cost of this single square root operation should be negligible. — Paul R, Apr 11 '13 at 17:37

Paul R · Accepted Answer · 2013-04-10T08:53:31.733

5

You can test a value for NaN by comparing the value with itself. x == x will return false if x is a NaN. So for a SSE vector of 4 x float values, vx:

    vmask = _mm_cmpeq_ps(vx, vx);

will give you a mask vector with all 0s for NaN elements in vx and all 1s for non-NaN elements. You can use the mask to zero out the NaNs. You can also use the mask to count the number of valid data points by treating it as a vector of 32 bit ints and accumulating.

Here is a working, tested example - note that it assumes n is a multiple of 4, that a, b are not 16 byte aligned, and note also that it requires SSE4.

float rms(const float *a, const float *b , int n)
{
    int count;
    float sum;
    __m128i vcount = _mm_set1_epi32(0);
    __m128 vsum = _mm_set1_ps(0.0f);
    assert((n & 3) == 0);
    for (int i = 0; i < n; i += 4)
    {
        __m128 va = _mm_loadu_ps(&a[i]);
        __m128 vb = _mm_loadu_ps(&b[i]);
        __m128 vmaska = _mm_cmpeq_ps(va, va);
        __m128 vmaskb = _mm_cmpeq_ps(vb, vb);
        __m128 vmask = _mm_and_ps(vmaska, vmaskb);
        __m128 vtmp = _mm_sub_ps(va, vb);
        vtmp = _mm_and_ps(vtmp, vmask);
        vtmp = _mm_mul_ps(vtmp, vtmp);
        vsum = _mm_add_ps(vsum, vtmp);
        vcount = _mm_sub_epi32(vcount, (__m128i)vmask);
    }
    vsum = _mm_hadd_ps(vsum, vsum);
    vsum = _mm_hadd_ps(vsum, vsum);
    _mm_store_ss(&sum, vsum);
    vcount = _mm_hadd_epi32(vcount, vcount);
    vcount = _mm_hadd_epi32(vcount, vcount);
    count = _mm_extract_epi32(vcount, 0);
    return count > 0 ? sum / (float)count : 0.0f;
}

edited Apr 10 '13 at 08:53

answered Apr 09 '13 at 21:22

Paul R

208,748
37
389
560

thanks, but if i would to it like "vmask = _mm_cmpeq_ps(vx, vx);" the enxt step is to add these 4 elements to one i think with _mm_hadd_ps called twice ? but then i have the nan values in my sum. so i first had do delete these nans out of the vector? :-/ – Roby Apr 10 '13 at 06:11
2

No - use the mask to zero out the NaNs. Also don't do any horizontal addition until after the loop. You can also use the mask to count the number of valid data points. – Paul R Apr 10 '13 at 06:40
@Roby: see above - I've now added a full working implementation. – Paul R Apr 10 '13 at 08:54
1

I think you missed Marc Gliesse's `_mm_cmpord_ps(a[i],b[i])` suggestion for checking if *either* input is NaN. That would save one `_mm_cmpeq_ps` and one `_mm_and_ps`. Also, this might require SSE4 to compile, but hopefully `_mm_extract_epi32(vcount, 0)` compiles to just `movd`, not `pextrd`! (You could use `_mm_cvtsi128_si32`). Then you're left with SSSE3 for the slow horizontal sum with `phaddd` outside the loop (small code-size being [its only advantage vs. efficient shuffles](https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86)) – Peter Cordes Jul 29 '19 at 06:48
1

`_mm_store_ss` is also a weird choice vs. `_mm_cvtss_f32` but hopefully that optimizes away. – Peter Cordes Jul 29 '19 at 06:49

SSE rms calculation

1 Answers1

Linked