Manual vectorization/SSE for a complex problem in C++

Question

I want to speed up my algorithm which is an objective function f(x). The problem dimension is 5000. I have already introduced a lot of improvement in the code, but still the calculation time does not fit to my expectation.

Most of the dataset are allocated dynamically as (float*)_mm_malloc(N_h*sizeof(float),16). In the objective function where "long" for loops are present I applied successfully the _mm_mul_ps, _mm_rcp_ps, _mm_store_ps ... etc on __m128Var variables. And I introduced threading (_beginthreadex) to speed up the most slowest code. But there is a part of code which cannot be vectorized easily... I attached the most problematic code (slowest calculation) where I still cannot make an improvement (reminder, this is a part of a code from a bigger calculation, but my problem can be seen with this). I am expecting vector calculations, but I got simple calculation for each row of code (a lot of MOVSS, MULSS, SUBSS...etc in the assembly code). Could someone give me a hint what can be a problem?

I am using MinGW GCC-8.2.0-3 compiler on Windows machine with -O3 -march=native -ffast-math flags.

#include <immintrin.h>
#include "math.h"
#define N_h 5000

float* x_vec;   // allocated as: (float*)_mm_malloc(N_h*sizeof(float),16);
float* data0; //allocated as: (float*)_mm_malloc(N_h*sizeof(float),16);
float* data1; //allocated as: (float*)_mm_malloc(N_h*sizeof(float),16);
float* data2; //allocated as: (float*)_mm_malloc(N_h*sizeof(float),16);
float* data3; //allocated as: (float*)_mm_malloc(N_h*sizeof(float),16);

int main() 
{
    float* q_vec = (float*)_mm_malloc(8*sizeof(float),16);
    float* xx_vec = (float*)_mm_malloc(8*sizeof(float),16);
    float* cP_vec = (float*)_mm_malloc(8*sizeof(float),16);
    float* xPtr = x_vec;
    float* f32Ptr;
    float c0;
    int n = N_h;
    int sum = 0;

    while(n > 0)
    {
        int k=1;
        n-=8;

        cP_vec[0] = 1;
        cP_vec[1] = 1;
        cP_vec[2] = 1;
        cP_vec[3] = 1;
        cP_vec[4] = 1;
        cP_vec[5] = 1;
        cP_vec[6] = 1;
        cP_vec[7] = 1;
        //preload of x data shall be done with vector preload, currently it is row-by-row **MOVS**
        xx_vec[0] = *xPtr++;
        xx_vec[1] = *xPtr++;
        xx_vec[2] = *xPtr++;
        xx_vec[3] = *xPtr++;
        xx_vec[4] = *xPtr++;
        xx_vec[5] = *xPtr++;
        xx_vec[6] = *xPtr++;
        xx_vec[7] = *xPtr++;

        c0 = data0[k];
        //I am expecting vector subtraction here, but each of the row generates almost same assembly code
        q_vec[0] = xx_vec[0] - c0;
        q_vec[1] = xx_vec[1] - c0;
        q_vec[2] = xx_vec[2] - c0;
        q_vec[3] = xx_vec[3] - c0;
        q_vec[4] = xx_vec[4] - c0;
        q_vec[5] = xx_vec[5] - c0;
        q_vec[6] = xx_vec[6] - c0;
        q_vec[7] = xx_vec[7] - c0;
        //if I create more internal variable for all of the multiplication, does it help?
        cP_vec[0] = cP_vec[0] * data1[k] * exp(-pow(q_vec[0], 2.0f) * data2[k]);
        cP_vec[1] = cP_vec[1] * data1[k] * exp(-pow(q_vec[1], 2.0f) * data2[k]);
        cP_vec[2] = cP_vec[2] * data1[k] * exp(-pow(q_vec[2], 2.0f) * data2[k]);
        cP_vec[3] = cP_vec[3] * data1[k] * exp(-pow(q_vec[3], 2.0f) * data2[k]);
        cP_vec[4] = cP_vec[4] * data1[k] * exp(-pow(q_vec[4], 2.0f) * data2[k]);
        cP_vec[5] = cP_vec[5] * data1[k] * exp(-pow(q_vec[5], 2.0f) * data2[k]);
        cP_vec[6] = cP_vec[6] * data1[k] * exp(-pow(q_vec[6], 2.0f) * data2[k]);
        cP_vec[7] = cP_vec[7] * data1[k] * exp(-pow(q_vec[7], 2.0f) * data2[k]);
        k++;
        f32Ptr = &data3[k];
        for (int j =1; j <= 5; j++) //the index of this for is defined by a variable in my application, so it is not a constant
        {
            c0 = data0[k];
            //here the subtraction and multiplication is not vectoritzed
            q_vec[0] = (xx_vec[0] - c0) * (*f32Ptr);
            q_vec[1] = (xx_vec[1] - c0) * (*f32Ptr);
            q_vec[2] = (xx_vec[2] - c0) * (*f32Ptr);
            q_vec[3] = (xx_vec[3] - c0) * (*f32Ptr);
            q_vec[4] = (xx_vec[4] - c0) * (*f32Ptr);
            q_vec[5] = (xx_vec[5] - c0) * (*f32Ptr);
            q_vec[6] = (xx_vec[6] - c0) * (*f32Ptr);
            q_vec[7] = (xx_vec[7] - c0) * (*f32Ptr);

            q_vec[0] = (0.5f - 0.5f*erf( q_vec[0] ) );
            q_vec[1] = (0.5f - 0.5f*erf( q_vec[1] ) );
            q_vec[2] = (0.5f - 0.5f*erf( q_vec[2] ) );
            q_vec[3] = (0.5f - 0.5f*erf( q_vec[3] ) );
            q_vec[4] = (0.5f - 0.5f*erf( q_vec[4] ) );
            q_vec[5] = (0.5f - 0.5f*erf( q_vec[5] ) );
            q_vec[6] = (0.5f - 0.5f*erf( q_vec[6] ) );
            q_vec[7] = (0.5f - 0.5f*erf( q_vec[7] ) );
            //here the multiplication is not vectorized...
            cP_vec[0] = cP_vec[0] * q_vec[0];
            cP_vec[1] = cP_vec[1] * q_vec[1];
            cP_vec[2] = cP_vec[2] * q_vec[2];
            cP_vec[3] = cP_vec[3] * q_vec[3];
            cP_vec[4] = cP_vec[4] * q_vec[4];
            cP_vec[5] = cP_vec[5] * q_vec[5];
            cP_vec[6] = cP_vec[6] * q_vec[6];
            cP_vec[7] = cP_vec[7] * q_vec[7];
            f32Ptr++;
            k++;
        }
        sum += cP_vec[0];
        sum += cP_vec[1];
        sum += cP_vec[2];
        sum += cP_vec[3];
        sum += cP_vec[4];
        sum += cP_vec[5];
        sum += cP_vec[6];
        sum += cP_vec[7];
    }
    return 0;
}

You can see the assembly code on Godbolt: https://godbolt.org/z/wbkNAk

UPDATE:

I have implemented some SSE calculations. The speedup is approx. x1.10-1.15 which is far below as expected... Am I do something wrong in the main()?

#include <immintrin.h>
#include "math.h"
#define N_h 5000

#define EXP_TABLE_SIZE 10
static const __m128 M128_1 = {1.0, 1.0, 1.0, 1.0};

float* x_vec;   // allocated as: (float*)_mm_malloc(N_h*sizeof(float),16);
float* data0; //allocated as: (float*)_mm_malloc(N_h*sizeof(float),16);
float* data1; //allocated as: (float*)_mm_malloc(N_h*sizeof(float),16);
float* data2; //allocated as: (float*)_mm_malloc(N_h*sizeof(float),16);
float* data3; //allocated as: (float*)_mm_malloc(N_h*sizeof(float),16);

typedef struct ExpVar {
    enum {
        s = EXP_TABLE_SIZE,
        n = 1 << s,
        f88 = 0x42b00000 /* 88.0 */
    };
    float minX[8];
    float maxX[8];
    float a[8];
    float b[8];
    float f1[8];
    unsigned int i127s[8];
    unsigned int mask_s[8];
    unsigned int i7fffffff[8];
    unsigned int tbl[n];
    union fi {
        float f;
        unsigned int i;
    };
    ExpVar()
    {
        float log_2 = ::logf(2.0f);
        for (int i = 0; i < 8; i++) {
            maxX[i] = 88;
            minX[i] = -88;
            a[i] = n / log_2;
            b[i] = log_2 / n;
            f1[i] = 1.0f;
            i127s[i] = 127 << s;
            i7fffffff[i] = 0x7fffffff;
            mask_s[i] = mask(s);
        }

        for (int i = 0; i < n; i++) {
            float y = pow(2.0f, (float)i / n);
            fi fi;
            fi.f = y;
            tbl[i] = fi.i & mask(23);
        }
    }
    inline unsigned int mask(int x)
    {
        return (1U << x) - 1;
    }
};

inline __m128 exp_ps(__m128 x, ExpVar* expVar)
{
    __m128i limit = _mm_castps_si128(_mm_and_ps(x, *(__m128*)expVar->i7fffffff));
    int over = _mm_movemask_epi8(_mm_cmpgt_epi32(limit, *(__m128i*)expVar->maxX));
    if (over) {
        x = _mm_min_ps(x, _mm_load_ps(expVar->maxX));
        x = _mm_max_ps(x, _mm_load_ps(expVar->minX));
    }

    __m128i r = _mm_cvtps_epi32(_mm_mul_ps(x, *(__m128*)(expVar->a)));
    __m128 t = _mm_sub_ps(x, _mm_mul_ps(_mm_cvtepi32_ps(r), *(__m128*)(expVar->b)));
    t = _mm_add_ps(t, *(__m128*)(expVar->f1));

    __m128i v4 = _mm_and_si128(r, *(__m128i*)(expVar->mask_s));
    __m128i u4 = _mm_add_epi32(r, *(__m128i*)(expVar->i127s));
    u4 = _mm_srli_epi32(u4, expVar->s);
    u4 = _mm_slli_epi32(u4, 23);

    unsigned int v0, v1, v2, v3;
    v0 = _mm_cvtsi128_si32(v4);
    v1 = _mm_extract_epi16(v4, 2);
    v2 = _mm_extract_epi16(v4, 4);
    v3 = _mm_extract_epi16(v4, 6);
    __m128 t0, t1, t2, t3;

    t0 = _mm_castsi128_ps(_mm_set1_epi32(expVar->tbl[v0]));
    t1 = _mm_castsi128_ps(_mm_set1_epi32(expVar->tbl[v1]));
    t2 = _mm_castsi128_ps(_mm_set1_epi32(expVar->tbl[v2]));
    t3 = _mm_castsi128_ps(_mm_set1_epi32(expVar->tbl[v3]));

    t1 = _mm_movelh_ps(t1, t3);
    t1 = _mm_castsi128_ps(_mm_slli_epi64(_mm_castps_si128(t1), 32));
    t0 = _mm_movelh_ps(t0, t2);
    t0 = _mm_castsi128_ps(_mm_srli_epi64(_mm_castps_si128(t0), 32));
    t0 = _mm_or_ps(t0, t1);

    t0 = _mm_or_ps(t0, _mm_castsi128_ps(u4));

    t = _mm_mul_ps(t, t0);

    return t;
}

int main() 
{
    float* q_vec = (float*)_mm_malloc(8*sizeof(float),16);
    float* xx_vec = (float*)_mm_malloc(8*sizeof(float),16);
    float* cP_vec = (float*)_mm_malloc(8*sizeof(float),16);
    float* xPtr = x_vec;
    float* f32Ptr;
    __m128 c0,c1;
    __m128* m128Var1;
    __m128* m128Var2;
    float* f32Ptr1;
    float* f32Ptr2;
    int n = N_h;
    int sum = 0;
    ExpVar expVar;

    while(n > 0)
    {
        int k=1;
        n-=8;

        //cP_vec[0] = 1;
        f32Ptr1 = cP_vec;
        _mm_store_ps(f32Ptr1,M128_1);
        f32Ptr1+=4;
        _mm_store_ps(f32Ptr1,M128_1);
        //preload x data
        //xx_vec[0] = *xPtr++;
        f32Ptr1 = xx_vec;
        m128Var1 = (__m128*)xPtr;
        _mm_store_ps(f32Ptr1,*m128Var1);
        m128Var1++;
        xPtr+=4;
        f32Ptr1+=4;
        m128Var1 = (__m128*)xPtr;
        _mm_store_ps(f32Ptr1,*m128Var1);
        xPtr+=4;

        c0 = _mm_set1_ps(data0[k]);
        m128Var1 = (__m128*)xx_vec;
        f32Ptr1 = q_vec;
        _mm_store_ps(f32Ptr1, _mm_sub_ps(*m128Var1, c0) );
        m128Var1++;
        f32Ptr1+=4;
        _mm_store_ps(f32Ptr1, _mm_sub_ps(*m128Var1, c0) );
        //calc -pow(q_vec[0], 2.0f)
        f32Ptr1 = q_vec;
        m128Var1 = (__m128*)q_vec;
        _mm_store_ps(f32Ptr1, _mm_mul_ps(*m128Var1, *m128Var1) );
        m128Var1++;
        f32Ptr1+=4;
        _mm_store_ps(f32Ptr1, _mm_mul_ps(*m128Var1, *m128Var1) );
        m128Var1 = (__m128*)q_vec;
        *m128Var1 = _mm_xor_ps(*m128Var1, _mm_set1_ps(-0.0));
        m128Var1++;
        *m128Var1 = _mm_xor_ps(*m128Var1, _mm_set1_ps(-0.0));
        //-pow(q_vec[0], 2.0f) * data2[k]

        c0 = _mm_set1_ps(data2[k]);
        f32Ptr1 = q_vec;
        m128Var1 = (__m128*)q_vec;
        _mm_store_ps(f32Ptr1, _mm_mul_ps(*m128Var1, c0) );
        m128Var1++;
        f32Ptr1+=4;
        _mm_store_ps(f32Ptr1, _mm_mul_ps(*m128Var1, c0) );
        m128Var1 = (__m128*)q_vec;
        //calc exp(x)
        *m128Var1 = exp_ps(*m128Var1,&expVar);
        m128Var1++;
        *m128Var1 = exp_ps(*m128Var1,&expVar);
        //data1[k] * exp(x)
        c0 = _mm_set1_ps(data1[k]);
        f32Ptr1 = q_vec;
        m128Var1 = (__m128*)q_vec;
        _mm_store_ps(f32Ptr1, _mm_mul_ps(*m128Var1, c0) );
        m128Var1++;
        f32Ptr1+=4;
        _mm_store_ps(f32Ptr1, _mm_mul_ps(*m128Var1, c0) );
        //cP_vec[0] * data1[k] * exp(x)
        f32Ptr1 = cP_vec;
        m128Var1 = (__m128*)cP_vec;
        m128Var2 = (__m128*)q_vec;
        _mm_store_ps(f32Ptr1, _mm_mul_ps(*m128Var1, *m128Var2) );
        m128Var1++;
        m128Var2++;
        f32Ptr1+=4;
        _mm_store_ps(f32Ptr1, _mm_mul_ps(*m128Var1, *m128Var2) );

        k++;
        for (int j =1; j <= 5; j++)
        {
            c0 = _mm_set1_ps(data0[k]);
            c1 = _mm_set1_ps(data3[k]);
            m128Var1 = (__m128*)xx_vec;
            m128Var2 = (__m128*)q_vec;
            f32Ptr1 = q_vec;
            _mm_store_ps(f32Ptr1, _mm_sub_ps(*m128Var1, c0) );
            _mm_store_ps(f32Ptr1, _mm_mul_ps(*m128Var2, c1) );
            m128Var1++;
            m128Var2++;
            f32Ptr1+=4;
            _mm_store_ps(f32Ptr1, _mm_sub_ps(*m128Var1, c0) );
            _mm_store_ps(f32Ptr1, _mm_mul_ps(*m128Var2, c1) );

            q_vec[0] = (0.5f - 0.5f*erf( q_vec[0] ) );
            q_vec[1] = (0.5f - 0.5f*erf( q_vec[1] ) );
            q_vec[2] = (0.5f - 0.5f*erf( q_vec[2] ) );
            q_vec[3] = (0.5f - 0.5f*erf( q_vec[3] ) );
            q_vec[4] = (0.5f - 0.5f*erf( q_vec[4] ) );
            q_vec[5] = (0.5f - 0.5f*erf( q_vec[5] ) );
            q_vec[6] = (0.5f - 0.5f*erf( q_vec[6] ) );
            q_vec[7] = (0.5f - 0.5f*erf( q_vec[7] ) );

            cP_vec[0] = cP_vec[0] * q_vec[0];
            cP_vec[1] = cP_vec[1] * q_vec[1];
            cP_vec[2] = cP_vec[2] * q_vec[2];
            cP_vec[3] = cP_vec[3] * q_vec[3];
            cP_vec[4] = cP_vec[4] * q_vec[4];
            cP_vec[5] = cP_vec[5] * q_vec[5];
            cP_vec[6] = cP_vec[6] * q_vec[6];
            cP_vec[7] = cP_vec[7] * q_vec[7];
            k++;
        }
        sum += cP_vec[0];
        sum += cP_vec[1];
        sum += cP_vec[2];
        sum += cP_vec[3];
        sum += cP_vec[4];
        sum += cP_vec[5];
        sum += cP_vec[6];
        sum += cP_vec[7];
    }
    return 0;
}

https://godbolt.org/z/N7K6j0

How accurate do you want it to be? Specifically the `erf` and `exp` — harold, Sep 16 '19 at 18:59
I think GCC only knows how to auto-vectorize `exp` if it has glibc's `libmvec` or other vectorized `exp(__m128)` available. Also, `pow(x, 2.0f)` is ridiculously horrible compared to `x*x`. Hopefully gcc optimizes that small constant exponent away, but I wouldn't write it that way in the first place. — Peter Cordes, Sep 16 '19 at 19:00
What optimization options are you currently using? `-O3 -march=native -ffast-math`? If you don't see any `mulps`, then it's *not* auto-vectorizing. Anyway, the highest-performance way is probably to use your own SIMD implementation of `exp` and `erf` to give the desired tradeoff of accuracy vs. performance, especially if you can assume non-NaN and non-infinite inputs. And if you know that your input to `exp` is always negative, that might help. — Peter Cordes, Sep 16 '19 at 19:04
I am using exacly these flags: -O3 -march=native -ffast-math And I am expecting a vector calculation for xx_vec[] - c0, cP_vec[]*data1[] as well. If I precalculate the exp and erf to an internal variable and then do the multiplication will it use SSE? — D_Dog, Sep 16 '19 at 19:23
@D_Dog perhaps, but those fancy functions are by far the slowest thing — harold, Sep 16 '19 at 19:27
@harold, you are rigth. I forgot to mention I measured the performance with Intel VTune Amplifier and the exp, and erf got a lot of calculation time. But I am not expert of VTune so I cannot optimize my code based on the logs from this software. — D_Dog, Sep 16 '19 at 19:31
@Peter Cordes, thanks for the advices. I will try to implement an SIMD for exp and erf. I have already tried to implement a look-up-table based exp and erf functions in a limited input range. But it does not fit to SIMD operations. And unfortunately the exp() part is not constant. The k variable can have different values from out of the scope of this code snippet. In the main while loop I set it to 1, but maybe I should not to do that.. — D_Dog, Sep 16 '19 at 19:35
@D_Dog that's quite accurate. You could use [an `exp` from here](https://stackoverflow.com/q/48863719/555045), these are quite accurate. For the `erf` maybe the [version with 1 over a 6th order polynomial to the 16th power](https://en.wikipedia.org/wiki/Error_function#Numerical_approximations) — harold, Sep 16 '19 at 19:46
Okay, thanks I will try it tomorrow. And so all of you think the vectorization on the other calculations do not improve anything at all? There are a lot of subtractions and multiplication. — D_Dog, Sep 16 '19 at 19:55
Perhaps, but there is a dangerously high overhead involved when collecting scalar results into vectors, and a little less (but not zero) for the other way around. So quickly switching between calling scalar functions and doing a couple of SIMD additions is mostly not good. Once the `exp` and `erf` are SIMD though, the rest should be converted also. — harold, Sep 16 '19 at 20:03
@D_Dog: Shuffling scalars into a vector can cost more than just doing 4x or 8x scalar mul or FMA operations. Clang might autovectorize that way for you, maybe worth trying if you're curious. But anyway, the main point is that vectorizing the exp and erf is where the big speedup lies, and then vectorizing the rest is trivial because the data is already in a vector. So it's pretty much a waste of time to try to get a compiler to auto-vectorize that part without vectorizing the expensive part. You could of course do it with intrinsics instead of tmp arrays. — Peter Cordes, Sep 16 '19 at 20:19
Okay. Thanks for the comments. Now I fully understand your statemens. I will work on the exp and erf. — D_Dog, Sep 16 '19 at 20:49
Hi all, I added vectorization for exp function and to other calculations. Unfortunately the speedup is not satisfying :( — D_Dog, Sep 18 '19 at 17:56
Hardly surprising; doing scalar table lookups in a table of 2^23 ints is likely to cause a lot of cache misses! That's why people normally use a polynomial approximation over the 1.0 to 2.0 range to get the mantissa, or something like that. Also, your shuffles for merging 4 floats into a single vector are pretty inefficient; that's what `_mm_unpack_ps` is for. But efficient `exp` won't unpack to scalar at all. See for example the implementation in Agner Fog's VCL; he recently switched the license from GPL to MIT. https://agner.org/optimize/ — Peter Cordes, Sep 18 '19 at 18:25
OP seems to have some Intel stuff installed, so maybe they have SVML and could use `-mveclibabi=svml`? — Marc Glisse, Sep 19 '19 at 06:37

score 2 · Accepted Answer · answered Sep 19 '19 at 13:25

The code is super weird with all those explicit stores into memory instead of plain old variables. I've tried to make it less weird, and added a vectorized erf, which is the main computation. Since I don't know what this code is supposed to do I can't really test it, except for performance, which did get better.

while (n > 0)
{
    int k = 1;
    n -= 8;

    //preload x data
    __m128 x_0 = _mm_load_ps(xPtr);
    __m128 x_1 = _mm_load_ps(xPtr + 4);
    xPtr += 8;

    __m128 c0 = _mm_set1_ps(data0[k]);
    __m128 q_0 = _mm_sub_ps(x_0, c0);
    __m128 q_1 = _mm_sub_ps(x_1, c0);
    //pow(q_vec, 2.0f)
    __m128 t_0 = _mm_mul_ps(q_0, q_0);
    __m128 t_1 = _mm_mul_ps(q_1, q_1);
    //-pow(q_vec[0], 2.0f) * data2[k]
    __m128 neg_data2k = _mm_xor_ps(_mm_set1_ps(data2[k]), _mm_set1_ps(-0.0));
    t_0 = _mm_mul_ps(t_0, neg_data2k);
    t_1 = _mm_mul_ps(t_1, neg_data2k);

    //exp(-pow(q_vec[0], 2.0f) * data2[k])
    t_0 = fast_exp_sse(t_0);
    t_1 = fast_exp_sse(t_1);
    //cP = data1[k] * exp(...)
    c0 = _mm_set1_ps(data1[k]);
    __m128 cP_0 = _mm_mul_ps(t_0, c0);
    __m128 cP_1 = _mm_mul_ps(t_1, c0);

    k++;
    for (int j = 1; j <= 5; j++)
    {
        __m128 data0k = _mm_set1_ps(data0[k]);
        __m128 data3k = _mm_set1_ps(data3[k]);
        // q = (x - data0k) * data3k;
        q_0 = _mm_mul_ps(_mm_sub_ps(x_0, data0k), data3k);
        q_1 = _mm_mul_ps(_mm_sub_ps(x_1, data0k), data3k);

        // q = 0.5 - 0.5 * erf(q)
        __m128 half = _mm_set1_ps(0.5);
        q_0 = _mm_sub_ps(half, _mm_mul_ps(half, erf_sse(q_0)));
        q_1 = _mm_sub_ps(half, _mm_mul_ps(half, erf_sse(q_1)));

        // cP = cP * q;
        cP_0 = _mm_mul_ps(cP_0, q_0);
        cP_1 = _mm_mul_ps(cP_1, q_1);
        k++;
    }

    __m128 t = _mm_add_ps(cP_0, cP_1);
    t = _mm_hadd_ps(t, t);
    t = _mm_hadd_ps(t, t);
    sum += _mm_cvtss_f32(t);
}

For erf I used:

__m128 erf_sse(__m128 x)
{
    __m128 a1 = _mm_set1_ps(0.0705230784);
    __m128 a2 = _mm_set1_ps(0.0422820123);
    __m128 a3 = _mm_set1_ps(0.0092705272);
    __m128 a4 = _mm_set1_ps(0.0001520143);
    __m128 a5 = _mm_set1_ps(0.0002765672);
    __m128 a6 = _mm_set1_ps(0.0000430638);
    __m128 one = _mm_set1_ps(1);
    __m128 p = _mm_add_ps(one,
        _mm_mul_ps(x, _mm_add_ps(a1,
            _mm_mul_ps(x, _mm_add_ps(a2,
                _mm_mul_ps(x, _mm_add_ps(a3,
                    _mm_mul_ps(x, _mm_add_ps(a4,
                        _mm_mul_ps(x, _mm_add_ps(a5,
                            _mm_mul_ps(x, a6))))))))))));
    p = _mm_mul_ps(p, p);
    p = _mm_mul_ps(p, p);
    p = _mm_mul_ps(p, p);
    p = _mm_mul_ps(p, p);
    return _mm_sub_ps(one, _mm_div_ps(one, p));
}

I'm not too sure about this one, it's just a formula from wikipedia transcribed into SSE instrinsics, using Horner's scheme to evaluate the polynomial. There is probably a better way.

For fast_exp_sse the usual combination of exponent extraction and a polynomial approximation. Going through a huge lookup table is a good way to ruin the SIMD gains.

Thanks for the suggestion, at the weekend I will try it out. But I am seeing gain in performance in advance just by looking of your implementation. Thanks again. — D_Dog, Sep 20 '19 at 07:12
I have to work on your erf_sse() fucntion a little bit. What is the best approach to limit/saturate the x input vector values? Because if the x contains values which are not in the range of the fit then we got very inacurate resutls. But luckily the erf function can be approxiated in the range of -2 to 2 well with poly and over the bounds the erf(x) shall be 1 or -1. So how shall I do this? How can I saturate a float value over a selected number? Shall I convert my problem to fixpoint for better performance and do the arithmetic with bitwise operations? — D_Dog, Sep 23 '19 at 08:45
@D_Dog `_mm_min_ps` and `_mm_max_ps` are handy for FP saturation, there's also a trick to `abs` away the sign, apply `min`, then put the sign back with bitwise ops. 16 bit fixpoint might be good for speed, for 32 bit take into account that multiplication will be very slow — harold, Sep 23 '19 at 09:01
Thanks for the suggestion, yeah it is really simple with these two. And I will try the trick with the sign. Because the erf_sse approximation is non accurate with negative numbers. And the fitting could be better/faster if we fit only to one region. Thanks for the idea! — D_Dog, Sep 23 '19 at 09:16
does it make sense to declare the variables beforehand as global variable/static variable? In your implementation a lot of variables are defined inside a loop in the scope of the loop. Will it be faster if they are defined as global variables, or the compiler is already doing some kind of optimization on the scope of the variables? — D_Dog, Sep 24 '19 at 09:34
@D_Dog it's OK like this, most of the variables will be in registers (unless they run out) and the rest would get allocated at function entry. Global variables may force memory operations. [static is bad](https://stackoverflow.com/q/52139380/555045) — harold, Sep 24 '19 at 09:36

Manual vectorization/SSE for a complex problem in C++

UPDATE:

1 Answers1