how would you optimize this vectorized sum of harmonics?

Question

I'm summing a bounch of harmonics together, with different phase/magnitude each, using vectorization (only SSE2 max as SIMD).

Here's my actual try:

float output = 0.0f;
simd::float_4 freqFundamentalNormalized = freq * (1.0f / sampleRate);
simd::float_4 harmonicIndex{1.0f, 2.0f, 3.0f, 4.0f};
simd::float_4 harmonicIncrement{4.0f, 4.0f, 4.0f, 4.0f};

// harmonics
const int numHarmonicsV4 = numHarmonics / 4;
const int numHarmonicsRemainder = numHarmonics - (numHarmonicsV4 * 4);

// v4
for (int i = 0; i < numHarmonicsV4; i++) {
    // signal
    simd::float_4 sineOutput4 = simd::sin(mPhases4[i] * g2PIf) * mMagnitudes4[i];

    for (int v = 0; v < 4; v++) {
        output += sineOutput4[v];
    }

    // increments
    mPhases4[i] += harmonicIndex * freqFundamentalNormalized;
    mPhases4[i] -= simd::floor(mPhases4[i]);

    harmonicIndex += harmonicIncrement;
}

// remainder
if (numHarmonicsRemainder > 0) {
    // signal
    simd::float_4 sineOutput4 = simd::sin(mPhases4[numHarmonicsV4] * g2PIf) * mMagnitudes4[numHarmonicsV4];

    for (int v = 0; v < numHarmonicsRemainder; v++) {
        output += sineOutput4[v];
    }

    // increments
    mPhases4[numHarmonicsV4] += harmonicIndex * freqFundamentalNormalized;
    mPhases4[numHarmonicsV4] -= simd::floor(mPhases4[numHarmonicsV4]);
}

but:

I think I can optimize it more, maybe with some math tricks, or saving in some increments
I don't like to repeat the "same code" once for V4, once for remainder (if the num of harmonics are not % 4): is there a way to put a sort of "mask" to the last V4 placing (for example) magnitudes at 0? (so it do the same operation in the same block, but won't sum to the final output).

It's expensive to horizontal-sum into `output` *inside* the loop. Instead, accumulate into a sum vector and hsum once at the end. [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/q/6996764) — Peter Cordes, Oct 02 '20 at 07:06
Hi mister @PeterCordes (really hoped in your effort on this question :)) `hsum` with SSE2? how would you do this? — markzzz, Oct 02 '20 at 07:10
The calculation is dominated by `sin` calculation. There exists some tricks to calculate the sinus iteratively (sin(a+b) = ...), without calling the `sin` function. In this case, you have to pay attention to accumulation of rounding errors, and recalculate the exact sinus from time to time. — Damien, Oct 02 '20 at 07:11
Also, `sin` is very expensive compared to the rest of your code, depending on the precision vs. performance tradeoff you choose (or your library chose). (Although `floor` is non-trivial without SSE4.1; a general version that handles possibly large numbers can be expensive: [Calculating floor & ceil of vector2 double using pre-SSE4](https://stackoverflow.com/q/61233195). if you don't need to handle huge numbers, rolling your own floor with add / subtract of large constants can round, although IIRC that trick only directly gives you the current rounding mode (to nearest, not toward -Inf) — Peter Cordes, Oct 02 '20 at 07:16
@markzzz: I already linked the canonical Q&A about SSE1/SSE2 horizontal sums in the comment you're replying to. My answer there is how I'd do that part. That's why I linked it. — Peter Cordes, Oct 02 '20 at 07:17
@Damien what to you mean with sin(a-b)? sin(kx) + sin(zx) is not equal to sin (kx + zx) :O can you show an example of what you mean? — markzzz, Oct 02 '20 at 07:19
If you have a constant phase increment dphi, then `sin((n+1)dphi) = sin(n dphi) cos(dphi) + cos(n dphi) sin(dphi)`. Same for `cos(n+1)dphi`. You calculate `cos(dphi)` and `sin(dphi)` only once .... The iterative formula is still valid if you have a phase at the origin. I have use this trick billions of times for Doppler generation — Damien, Oct 02 '20 at 07:22
Your code looks somewhat like a naïve implementation of the inner loop of a [DST](https://en.wikipedia.org/wiki/Discrete_sine_transform) (if you actually want to calculate `output` for different offsets). You don't show how you initialize `mPhases4` in your snippet -- please check how to post a [mre]. — chtz, Oct 02 '20 at 08:34
@Damien not sure i'm following your math. I can cache `cos(dphi)` and `sin(dphi)`, but I still need to calculate `sin(n dphi)` and `cos(n dphi)` right? why this should be convenient? — markzzz, Oct 02 '20 at 10:40
You can calculate them iteratively. Look at @MSalters answer. A complex multiplication is cheaper than a sinus calculation. — Damien, Oct 02 '20 at 10:53
@Damien: i really don't get what you are suggesting here :) complex multiplication? in any case, I need to use `e^(-ki)` using Euler formula :O — markzzz, Oct 02 '20 at 12:32
`e^(-ki)` is a complex number. I don't really understand your point. All needed operations can be expressed with complex numbers. Not absolutely necessary. A cosmetic simplification, same operations — Damien, Oct 02 '20 at 12:37
@Damien but i still need to calculate exp for each phase multiplier: http://coliru.stacked-crooked.com/a/8c1c3a334e3e03b8 are you saying that calculate exp instead of sin is cheaper? — markzzz, Oct 02 '20 at 12:39
Again, look at Msalters answer. You don't need to directly call `exp(i*)`, but implement the iterative formula and use a simple complex multiplication for it — Damien, Oct 02 '20 at 12:44
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/222401/discussion-between-markzzz-and-damien). — markzzz, Oct 02 '20 at 13:23
@Damien: notice that (as said in the question) each harmonics got its own phase. Thus, i don't have a constant phase increment. I don't think i take your approch here (even if i learn somethings new )) — markzzz, Oct 03 '20 at 07:45

score 4 · Accepted Answer · answered Oct 02 '20 at 07:24

4

The second part of the question is the easiest. Any harmonic with magnitude 0 does not affect the sine output, so you just pad mMagnitude to a multiple of 4.

As Damien points out, sin(x) is expensive. But by Euler, exp(x)=cos(x) + i sin(x), and exp(x+dx)==exp(x)*exp(dx). Each step is just a complex multiplication.

answered Oct 02 '20 at 07:24

MSalters

173,980
10
155
350

second part: how would you pad it faster? mMagnitudes4 is a `std::array mMagnitudes4`: let say I've 13 harmonics, should i iterate from 13 to 15 and set 0.0f? not sure its efficient :) – markzzz Oct 02 '20 at 07:40
I mean: would you do somethings like this? `for (int i = 0; i < numHarmonicsRemainder; i++) { mMagnitudes4[numHarmonicsV4][i] = 0.0f; }` – markzzz Oct 02 '20 at 07:46
2

@markzzz: You set `mMagnitude[13]` through `[15]` to 0.0f *once*, where you initialize the other mMagnitude members. You eliminate the remainder part from the code shown, which is performance-critical. This means all math is now in the v4 part. Better for cache, better for branch prediction. – MSalters Oct 02 '20 at 07:49
about your last statement, not sure if I follow the math. Any link/resource about summing harmonics? – markzzz Oct 02 '20 at 10:38
@markzzz: It's not about "summing harmonics". It's a basic property of addition. `a+b+c+0` is just `a+b+c`. But `a+b+c+0` has 4 terms so it can be handled by your " v4" code. – MSalters Oct 02 '20 at 10:41
I meant the Euler formula as "lasts statement" :) – markzzz Oct 02 '20 at 10:42
2

@markzzz: You correctly noted that "sin(kx) + sin(zx) is not equal to sin (kx + zx)". Damien did have the correct formula for `sin (kx + zx)`, although he used different symbols. I tend to not remember all the sin/cos rules, but use Euler's formula to derive them. E.g. to work out the example `exp(x)*exp(dx)==(cos(x)+i sin(x)) * (cos(dx)+i sin(dx))` = `cos(x)cos(dx)-sin(x)sin(dx) + i sin(x)cos(dx) + i cos(x) sin(dx)` which gets you back to Damien's formula. – MSalters Oct 02 '20 at 10:52
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/222398/discussion-between-markzzz-and-msalters). – markzzz Oct 02 '20 at 12:06

score 2 · Answer 2 · answered Oct 02 '20 at 07:58

2

First and foremost, make sure your implementation of simd::sin is fast. See XMVectorSin and especially XMVectorSinEst in DirectXMath library for an example how to make a fast one, or copy-paste from there, or include the library, it’s header-only. The instruction set is switchable with preprocessor macros, for optimal performance it needs SSE 4.1 and FMA3, but will work OK with SSE2-only.

As said in comments, you should only do horizontal add once, after all iterations of the loop are complete. Until then, accumulate in a SIMD vector.

Very minor and might be optimized by the compiler, but still, you should not access mPhases4 like you’re doing. Load the value into vector at the start of the loop body, compute output, increment, compute fractional part, and store the updated value just once per iteration.

answered Oct 02 '20 at 07:58

Soonts

20,079
9
57
130

yes, the simd:sin is very nice and fast ;) (i.e. sse_mathfun_sin_ps, nice for dsp). about last part, do you mean `simd::float_4 phases4 = mPhases4[i];`, use phases4, than `mPhases4[i] = phases4;`? well of course, but I think compiler already does this for me. but i'll write this way ;) – markzzz Oct 02 '20 at 08:23
1

@markzzz “sse_mathfun_sin_ps” Did a quick benchmark, `XMVectorSin` is 16% faster with SSE2, 50% faster with SSE4 and 109% faster with FMA3. `XMVectorSinEst` is even faster than that. “about last part” Yep, that’s exactly what I mean. “I think compiler already does this for me” usually but not always, C++ defines a memory model, compilers are only allowed to do that when they are certain no other thread is going to see that memory. – Soonts Oct 02 '20 at 09:22
@markzzz By the way, you gonna try very hard to find an x86 device in working condition that doesn’t support SSE3 or 4.1. SSE3 arrived in later revisions of Pentium 4, SSE 4.1 was introduced in Core Solo/Duo, both ere supported universally since then, including AMD CPUs and Intel Atoms. – Soonts Oct 02 '20 at 09:29
That's are the specific from the environment where i'm working in... can't decide myself :) anyway, not all x64 processors support SSE4. I build with -march=nocona, probably SSE3 are supported though... – markzzz Oct 02 '20 at 10:07
@markzzz: Even though Nocona dates back to 2004, it indeed has SSE3. But do you really need to support 16 year old CPU's? – MSalters Oct 02 '20 at 10:15
@markzzz SSE3 probably not gonna change anything, the jump from 16% to 50% is caused by roundps https://www.felixcloutier.com/x86/roundps sin needs that to normalize the angle into [ -pi .. +pi ]. The SSE1 workaround takes 9 instructions and up to 4 memory references: https://github.com/microsoft/DirectXMath/blob/aug2020/Inc/DirectXMathVector.inl#L2331-L2340 – Soonts Oct 02 '20 at 10:28

how would you optimize this vectorized sum of harmonics?

2 Answers2