Is it possible to use SIMD on a serial dependency in a calculation, like an exponential moving average filter?

Question

I'm processing multiple (independent) Exponential Moving Average 1-Pole filters on different parameters I have within my Audio application, with the intent of smooth each param value at audio rate:

for (int i = 0; i < mParams.GetSize(); i++) {
    mParams.Get(i)->SmoothBlock(blockSize);
}

...

inline void SmoothBlock(int blockSize) {
    double inputA0 = mValue * a0;

    for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
        mSmoothedValues[sampleIndex] = z1 = inputA0 + z1 * b1;
    }
}

I'd like to take advantage of CPU SIMD instructions, processing them in parallel, but I'm not really sure how I can achieve this.

In fact, z1 is recursive: can't "pack" array of double considering "previous values", right?

Maybe is there a way to properly organize data of different filters and process them in parallel?

Any tips or suggestions would be welcome!

Please note: I don't have several signals paths. Any params represent different controls for the (unique) processing signal. Say I've a sin signal: param 1 will affect gain, param 2 pitch, param 3 filter cutoff, param 4 pan, and so on.

Terminology: that's not recursion, that's a serial dependency. Yes, it's a problem. This specific problem is something like a prefix sum. There could be some speedup for `float` vectors, especially with AVX, but it's hard to get much with only 2 elements per vector (`double` without AVX). [parallel prefix (cumulative) sum with SSE](https://stackoverflow.com/q/19494114) and [SIMD prefix sum on Intel cpu](https://stackoverflow.com/q/10587598) — Peter Cordes, Dec 16 '18 at 17:24
Oh, or yeah if you can interleave your input / output data into quads or octets from different filters, that would hide FP latency and allow trivial SIMD to do 4, 6, or 8 filters at once. (Optionally with different FP constants per filter). Hiding FP add/mul (or FMA) latency is just as useful as enabling SIMD here. — Peter Cordes, Dec 16 '18 at 17:26
If I understand your question correctly, then you want to filter several signals at once. It's right? If so, simply process the signals in parallel with one set of SIMD instructions. — Dmytro Dadyka, Dec 16 '18 at 17:28
I don't have several signals path :) Any params represent different controls for the (unique) processing signal. Say I've a sin signal: param 1 will affect gain, param 2 pitch, param 3 filter cutoff, param 4 pan, and so on. — markzzz, Dec 16 '18 at 17:34
So do you process one input signal with one averaging filter and get one output signal? In the question you wrote about multiple filters . We need to discover something that can be done in parallel. — Dmytro Dadyka, Dec 16 '18 at 17:40
@ Dmytro Dadyka: no! The averaging filters are linked to each param. They smooth at audio rate. Let say that param 1 (gain) has value at 0.5 and the user (or some automation) change it at 0.3, I need to smooth it out, else it will introduce distortion/zapper noise once applied to the sin signal. The filter is taken from http://musicdsp.org/archive.php?classid=3#257 . With parallel filters I mean there are "many" of these filters that will process, since the params are many. — markzzz, Dec 16 '18 at 17:43
Your code doesn't tell us what `b1` is, is it a constant or does it vary? What doea `mValue` add to your problem, it **seems** extranious? Can't you try for an example that is minimal and complete? I mean, I think you are just doing `foo[i] = k + c * foo[i-1]`, where `foo[0] = k`, but I cannot tell because your question is unclear. At 18000 score and no [mcve]? — Yakk - Adam Nevraumont, Dec 16 '18 at 17:48
@Yakk-AdamNevraumont: `b1` and `a0` are coefficients of the param's filter (check them here http://musicdsp.org/archive.php?classid=3#257). `mValue` is the actual param's value set by user and/or automation. — markzzz, Dec 16 '18 at 17:51
I think I understood your problem. You have initial and final values of the parameter and want to get exponential interpolated data or "Step response" for moving average filter https://en.wikipedia.org/wiki/Step_response. Is it right? — Dmytro Dadyka, Dec 16 '18 at 17:57
@DmytroDadyka yes, sort of. When mValue change (each filter got its own, as well with coefficients) it smooth to that value. — markzzz, Dec 16 '18 at 18:00
@DmytroDadyka: there isn't "initial and final" values. Just a `mValue`, which I set during the time, and the filter "smooth" to it sample by sample, till FP "ends its value" and successive computation will end to the same result. — markzzz, Dec 16 '18 at 18:08

Dmytro Dadyka · Answer 1 · 2018-12-16T21:05:19.940

You have a special case where the input signal is Heaviside step function. You want obtain a filter response to this function, which is called a Step response. Recursion in this case can be eliminated. First, let's unroll out the recursion for a few steps.

z[1] = in + z[0]*b
z[2] = in + z[1]*b = in + (in + z[0]*b)*b  = in*(1 + b) + z[0]*b^2
z[3] = in + z[2]*b = in*(1 + b + b^2) + z[0]*b^3
z[4] = in + z[3]*b = in*(1 + b + b^2 + b^3) + z[0]*b^4

From the last eq:

z[1] = in*(1 + b + b^2 + b^3) + z[-3]*b^4
z[2] = in*(1 + b + b^2 + b^3) + z[-2]*b^4
z[3] = in*(1 + b + b^2 + b^3) + z[-1]*b^4
z[4] = in*(1 + b + b^2 + b^3) + z[0]*b^4

Now very easy to rewrite it in a vectorized form.

in' = {in, in, in, in};
z' = in' * (1 + b + b^2 + b^3) + z'*b^4

Where "'" means vector or single SIMD register. Now it will be easy to translate it into immintrin instructions. Note that now you can not change the input value at any sample, but at times multiples of four samples.

In addition, you can represent two or more SIMD registers as one vector and expand recursion more. This will increase performance since better pipeline utilization, but do not overdo it, otherwise you will not have enough registers.

but doing this I always "store" every 4 values on the array i.e. `mSmoothedValues`, not all values. Which I need, since later I'll process each value applied to the signal. — markzzz, Dec 17 '18 at 08:04
@markzzz, you save 4 values at a time. The code is similar to yours, just for one iteration you get 4 sequential values at once. — Dmytro Dadyka, Dec 17 '18 at 11:51
@markzzz: this is exactly the same answer as mine, but without an example in C++, just the math. (Fun fact- we finished typing and posted our answers within half a minute of each other.) — Peter Cordes, Dec 17 '18 at 15:35

score 4 · Accepted Answer · answered Dec 16 '18 at 18:53

If there's a closed-form formula for n steps ahead, you can use that to sidestep serial dependencies. If it can be computed with just different coefficients with the same operations as for 1 step, a broadcast is all you need.

Like in this case, z1 = c + z1 * b, so applying that twice we get

# I'm using Z0..n as the elements in the series your loop calculates
Z2 = c + (c+Z0*b)*b
   = c + c*b + Z0*b^2

c + c*b and b^2 are both constants, if I'm understanding your code correctly that all the C variables are really just C variables, not pseudocode for an array reference. (So everything except your z1 are loop invariant).

So if we have a SIMD vector of 2 elements, starting with Z0 and Z1, we can step each of them forward by 2 to get Z2 and Z3.

void SmoothBlock(int blockSize, double b, double c, double z_init) {

    // z1 = inputA0 + z1 * b1;
    __m128d zv = _mm_setr_pd(z_init, z_init*b + c);

    __m128d step2_mul = _mm_set1_pd(b*b);
    __m128d step2_add = _mm_set1_pd(c + c*b);

    for (int i = 0; i < blockSize-1; i+=2) {
        _mm_storeu_pd(mSmoothedValues + i, zv);
        zv = _mm_mul_pd(zv, step2_mul);
        zv = _mm_add_pd(zv, step2_add);
       // compile with FMA + fast-math for the compiler to fold the mul/add into one FMA
    }
    // handle final odd element if necessary
    if(blockSize%2 != 0)
        _mm_store_sd(mSmoothedValues+blockSize-1, zv);
}

With float + AVX (8 elements per vector), you'd have

__m256 zv = _mm256_setr_ps(z_init, c + z_init*b,
                         c + c*b + z_init*b*b,
                         c + c*b + c*b*b + z_init*b*b*b, ...);

// Z2 = c + c*b + Z0*b^2
// Z3 = c + c*b + (c + Z0*b) * b^2
// Z3 = c + c*b + c*b^2 + Z0*b^3

and the add/mul factors would be for 8 steps.

Normally people use float for SIMD because you get twice as many elements per vector, and half the memory bandwidth / cache footprint, so you typically get a factor of 2 speedup over float. (Same number of vectors / bytes per clock.)

The above loop on a Haswell or Sandybridge for example CPU will run at one vector per 8 cycles, bottlenecked on the latency of mulpd (5 cycles) + addps (3 cycles). We generate 2 double results per vector, but that's still a huge bottleneck compared to 1 mul and 1 add per clock throughput. We're missing out on a factor of 8 throughput.

(Or if compiled with one FMA instead of mul->add, then we have 5 cycle latency).

Sidestepping the serial dependcy is useful not just for SIMD but for avoiding bottlenecks on FP add/mul (or FMA) latency will give further speedups, up to the ratio of FP add+mul latency to add+mul throughput.

Simply unroll more, and use multiple vectors, like zv0, zv1, zv2, zv3. This increases the number of steps you make at once, too. So for example, 16-byte vectors of float, with 4 vectors, would be 4x4 = 16 steps.

Oh... my.. GOD! From a test I did, it passes from ~300ms to ~80ms to complete. What an improvement! Not really sure yet by the math, I'll investigate further right now! Meanwhile, THANKS for your huge support dude! — markzzz, Dec 17 '18 at 10:08

Is it possible to use SIMD on a serial dependency in a calculation, like an exponential moving average filter?

2 Answers2

Linked