x86 Intrinsic: How to optimize outer/inner loop of FIR

Question

The following code is used to calculate FIR:

    void Fir(float* pIn, float* pOut, float* pCoeff, float* pStage, uint32_t N, uint32_t FilterLength)
{
    int n, k;
    float* pSrc;
    float* pCoeffSrc = pCoeff;
    float* pDst = pOut;
    float s0, s1, s2, s3;
    __m128 Vec, Mul;
    __m128 Sum0,Sum1,Sum2,Sum3;

    __m128 Zero = _mm_set_ps1(0);
    memcpy(&pStage[FilterLength - 1], pIn, N * sizeof(float));

    for (n = 0; n < N; n+=4)
    {
        //Sum0
        pSrc = &pStage[n];
        Sum0 = _mm_set_ps1(0);
        pCoeffSrc = pCoeff;

        for (k = 0; k < FilterLength >> 2; k++)
        {
            __m128 Coeff = _mm_load_ps(pCoeffSrc);
            Vec = _mm_load_ps(pSrc); 
            Sum0  = _mm_fmadd_ps(Coeff, Vec, Sum0);
            pCoeffSrc += 4;
            pSrc += 4;
        }

        Sum0 = _mm_hadd_ps(Sum0, Zero);
        Sum0 = _mm_hadd_ps(Sum0, Zero);

        //Sum1
        pSrc = &pStage[n+1];
        Sum1 = _mm_set_ps1(0);
        pCoeffSrc = pCoeff;

        for (k = 0; k < FilterLength >> 2; k++)
        {
            __m128 Coeff = _mm_load_ps(pCoeffSrc);
            Vec = _mm_load_ps(pSrc);
            Sum1 = _mm_fmadd_ps(Coeff, Vec, Sum1);
            pCoeffSrc += 4;
            pSrc += 4;
        }

        Sum1 = _mm_hadd_ps(Sum1, Zero);
        Sum1 = _mm_hadd_ps(Sum1, Zero);

        //Sum2
        pSrc = &pStage[n+2];
        Sum2 = _mm_set_ps1(0);
        pCoeffSrc = pCoeff;

        for (k = 0; k < FilterLength >> 2; k++)
        {
            __m128 Coeff = _mm_load_ps(pCoeffSrc);
            Vec = _mm_load_ps(pSrc);
            Sum2 = _mm_fmadd_ps(Coeff, Vec, Sum2);
            pCoeffSrc += 4;
            pSrc += 4;
        }

        Sum2 = _mm_hadd_ps(Sum2, Zero);
        Sum2 = _mm_hadd_ps(Sum2, Zero);

        //Sum3
        pSrc = &pStage[n+3];
        Sum3 = _mm_set_ps1(0);
        pCoeffSrc = pCoeff;

        for (k = 0; k < FilterLength >> 2; k++)
        {
            __m128 Coeff = _mm_load_ps(pCoeffSrc);
            Vec = _mm_load_ps(pSrc);
            Sum3 = _mm_fmadd_ps(Coeff, Vec, Sum3);
            pCoeffSrc += 4;
            pSrc += 4;
        }

        Sum3 = _mm_hadd_ps(Sum3, Zero);
        Sum3 = _mm_hadd_ps(Sum3, Zero);

        Vec = _mm_set_ps(Sum3.m128_f32[0], Sum2.m128_f32[0], Sum1.m128_f32[0], Sum0.m128_f32[0]);
        _mm_store_ps(pDst, Vec);
        pDst+=4;
    }
}

The result of the each inner loop (4) is a scalar sum of a vector. Then I create a vector from 4 scalars by:

Vec = _mm_set_ps(Sum3.m128_f32[0], Sum2.m128_f32[0], Sum1.m128_f32[0], Sum0.m128_f32[0]);

Vec is stored in RAM by: _mm_store_ps(pDst, Vec);

Can I optimize this code ?

Thank you, Zvika

The part you asked about in the title is a duplicate of [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/q/6996764). The dot-product in the inner loop is a duplicate of [AVX2: Computing dot product of 512 float arrays](https://stackoverflow.com/q/59494745) and [Unrolling FP loops with multiple accumulators](https://stackoverflow.com/q/45113527) (unless the iteration count is small enough for out-of-order exec to see across outer loop iterations. Otherwise it bottlenecks on FMA latency, not load throughput). — Peter Cordes, Apr 14 '23 at 05:50
But really, you'd want to look at both parts of your problem together: unroll the *outer* loop, doing maybe 4, 8, or 12 vectors in parallel to hide FMA latency and reuse the same coefficient vector for multiple vectors of data. Then you could usefully use `_mm_hadd_ps` to get four contiguous floats to store into `pDst`. (Or use 256-bit vectors for twice the throughput.) With more FMAs per load of a coefficient vector, you're gaining throughput, getting closer to the bottleneck of 2x 256-bit FMAs per clock on modern x86 CPUs. — Peter Cordes, Apr 14 '23 at 05:55
So really your question title isn't a useful description of the real problem. Based on the title it's a simple duplicate of [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/q/6996764). But that would not be an efficient way to deal with your whole function. (Also, aren't there algorithmic optimizations for FIR filters, like an FFT to make it O(n log n) instead of O(N^2), or whatever the complexity of of this brute-force way?) — Peter Cordes, Apr 14 '23 at 05:56
Hi Peter. Thank you very much for your replies. I added 4 inner loops. The result of each loop is Sum0, Sum1, Sum2, Sum3. Those 4 numbers are constructed into __m128 and then written to RAM. Can I further optimize it ? — Zvi Vered, Apr 14 '23 at 06:18
/facepalm. I meant `Sum0 _mm_hadd_ps(Sum0, Sum1);` / `Sum2 = _mm_hadd_ps(Sum2, Sum3)` / `Sum0 = _mm_hadd_ps(Sum0, Sum2)` which should hsum each of 4 vectors into elements of one vector. Also, four separate loops does nothing to hide FP latency or to reuse the same `__m128 coef = _mm_load_ps(pCoeffSrc);` across multiple FMAs. You *just* unrolled without doing any of the optimizations I suggested. — Peter Cordes, Apr 14 '23 at 06:34
Also, wait a minute; `pSrc = &pStage[n];` with `n++` in the outer loop only increments by one float, so it can't be aligned every time. The only reason this doesn't crash on a misaligned `_mm_load_ps` is that MSVC uses `movups` even for `_mm_load_ps`, not just `_mm_loadu_ps`. (And/or it might make it a memory source operand for FMA, which requires a VEX encoding so doesn't require alignment.) You should be using `_mm_loadu_ps`. Or preferably `_mm256_loadu_ps`, but if you are going to use 128-bit vectors, you can manufacture the sliding window from 2 loads and some shuffles like `palignr` — Peter Cordes, Apr 14 '23 at 06:37
Is it useful to broadcast-load from one array while doing normal loads from the other? That might let you avoid the hsum at the end, if that can get elements from adjacent dot products into the elements of a Sum vector. (AVX does 32-bit broadcast-loads for free, not costing an extra shuffle uop.) You'd still want to unroll over multiple sum vectors. — Peter Cordes, Apr 14 '23 at 09:10
How large are `N` and `FilterLength` typically? Is `N` always a multiple of 4 or do you don't care for the border? Is (the beginning of) `pStage` pre-filled with zeroes? — chtz, Apr 14 '23 at 12:52
Hi Peter. Can you please explain: "unroll the outer loop, doing maybe 4, 8, or 12 vectors in parallel to hide FMA latency and reuse the same coefficient vector for multiple vectors of data" ? Can you please provide a simple demo code that explains how to do "4,8 or 12" vectors in parallel ? Of course it does not need to be a FIR code. — Zvi Vered, Apr 14 '23 at 18:30
With a Filter that large compared to `N`, you could already consider doing an FFT-based approach, i.e., `pOut = iFFT(FFT(pIn) * FFT(pCoeff))`, where `*` is an element-wise product and `pIn` and `pCoeff` need to be padded by zeroes to have at most 1 overlap (e.g., `156+133 - 1 = 288`) and ideally have a size which is the product of small prime factors (in this case `288 = 32*9` is very good in other cases, you may have to pad with more zeros). — chtz, Apr 14 '23 at 18:51
Don't forget to use @username when replying to someone. Anyway, I already linked you an example of unrolling a dot product, as in [AVX2: Computing dot product of 512 float arrays](https://stackoverflow.com/q/59494745). When doing that, it should be obvious that the same coefficients vector can be reused. — Peter Cordes, Apr 14 '23 at 19:33
If you want to make this question about the whole function, change the title. For now it's still about the haddps part, so posting an optimized version of the whole function wouldn't be a good fit for the question. — Peter Cordes, Apr 14 '23 at 19:43
Hi Pavel. Thank you for the comment. I tried using ippsFIRSR_32f instead of my code. Unfortunately the run time is almost the same. This is a big surprise because my code is naive. — Zvi Vered, Apr 15 '23 at 05:47

score 0 · Answer 1 · answered Apr 14 '23 at 18:24

The following version is faster but not fully optimized. Added value: no memcpy, the inner loop is smaller and depends on the index of the outer loop.

    void Fir(float* pIn, float* pOut, float* pCoeff, uint32_t N, uint32_t FilterLength)
{
    int n, k;
    float* pSrc;
    float* pCoeffSrc = pCoeff;
    float* pDst = pOut;
    __m128 Vec, Mul;
    __m128 Sum;;
    __m128 Zero = _mm_set_ps1(0);
    uint32_t Offset;

    for (n = 0; n < N; n++)
    {
        pSrc = pIn;
        Sum = _mm_set_ps1(0);
        Offset = FilterLength - 1 - n;
        pCoeffSrc = pCoeff + Offset;

        for (k= Offset; k<FilterLength; k+=4)
        {
            __m128 Coeff = _mm_load_ps(pCoeffSrc);
            Vec = _mm_load_ps(pSrc); 
            Sum  = _mm_fmadd_ps(Coeff, Vec, Sum);
            pCoeffSrc += 4;
            pSrc += 4;
        }

        Sum = _mm_hadd_ps(Sum, Zero);
        Sum = _mm_hadd_ps(Sum, Zero);
        *pDst = Sum.m128_f32[0];
        pDst++;
    }
}

I'm aware _mm_load_ps does not work on aligned address. To be continued.

I unrolled the outer loop to 12 (from "pSrc=pIn" to "pDst++). It works 25% faster. Thank you for the tip Peter. — Zvi Vered, Apr 15 '23 at 06:44

score 0 · Answer 2 · answered Apr 15 '23 at 14:40

The following code runs x3 faster than the previous one. Major change: __m256 instead of __m128

    void Fir(float* pIn, float* pOut, float* pCoeff, uint32_t N, uint32_t FilterLength)
{
    int n, k;
    float* pSrc;
    float* pCoeffSrc = pCoeff;
    float* pDst = pOut;

    __m256 Vec, Mul;
    __m256 Sum;;
    __m256 Zero = _mm256_setr_ps(0, 0, 0, 0, 0, 0, 0, 0);
    uint32_t Offset;

    for (n = 0; n < N; n++)
    {
        pSrc = pIn;
        Sum = _mm256_setr_ps(0,0,0,0,0,0,0,0);
        Offset = FilterLength - 1 - n;
        pCoeffSrc = pCoeff + Offset;

        for (k = Offset; k < FilterLength; k += 8)
        {
            __m256 Coeff = _mm256_load_ps(pCoeffSrc);
            Vec = _mm256_load_ps(pSrc);
            Sum = _mm256_fmadd_ps(Coeff, Vec, Sum);
            pCoeffSrc += 8;
            pSrc += 8;
        }

        *pDst = Sum.m256_f32[0] + Sum.m256_f32[1] + Sum.m256_f32[2] + Sum.m256_f32[3] + Sum.m256_f32[4] + Sum.m256_f32[5] + Sum.m256_f32[6] + Sum.m256_f32[7];
        pDst++;
    }
}

I'm quite sure this code is not fully optimized. e.g, the sum of 8 elements.

Are you sure this works and does the same thing as your original? Your original code did `pSrc = &pStage[n];` and `pCoeffSrc = pCoeff;` in the outer loop, so is used the same coefficients every time for a different starting point. Now you're starting from the start of `pIn` every time, but with a different offset into the coefficients. Did you get that backwards, swapping pIn and pCoeff in the caller? — Peter Cordes, Apr 28 '23 at 12:35

x86 Intrinsic: How to optimize outer/inner loop of FIR

2 Answers2