Using % with SSE2?

Question

Here's the code I'm trying to convert to SSE2:

double *pA = a;
double *pB = b[voiceIndex];
double *pC = c[voiceIndex];
double *left = audioLeft;
double *right = audioRight;
double phase = 0.0;
double bp0 = mNoteFrequency * mHostPitch;

for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
    // some other code (that will use phase)

    phase += std::clamp(mRadiansPerSample * (bp0 * pB[sampleIndex] + pC[sampleIndex]), 0.0, PI);

    while (phase >= TWOPI) { phase -= TWOPI; }
}

Here's what I've achieved:

double *pA = a;
double *pB = b[voiceIndex];
double *pC = c[voiceIndex];
double *left = audioLeft;
double *right = audioRight;
double phase = 0.0;
double bp0 = mNoteFrequency * mHostPitch;

__m128d v_boundLower = _mm_set1_pd(0.0);
__m128d v_boundUpper = _mm_set1_pd(PI);
__m128d v_bp0 = _mm_set1_pd(bp0);
__m128d v_radiansPerSample = _mm_set1_pd(mRadiansPerSample);

__m128d v_phase = _mm_set1_pd(phase);
__m128d v_pB = _mm_load_pd(pB);
__m128d v_pC = _mm_load_pd(pC);
__m128d v_result = _mm_mul_pd(v_bp0, v_pB);
v_result = _mm_add_pd(v_result, v_pC);
v_result = _mm_mul_pd(v_result, v_radiansPerSample);
v_result = _mm_max_pd(v_result, v_boundLower);
v_result = _mm_min_pd(v_result, v_boundUpper);

for (int sampleIndex = 0; sampleIndex < roundintup8(blockSize); sampleIndex += 8, pB += 8, pC += 8) {
    // some other code (that will use v_phase)

    v_phase = _mm_add_pd(v_phase, v_result);

    v_pB = _mm_load_pd(pB + 2);
    v_pC = _mm_load_pd(pC + 2);
    v_result = _mm_mul_pd(v_bp0, v_pB);
    v_result = _mm_add_pd(v_result, v_pC);
    v_result = _mm_mul_pd(v_result, v_radiansPerSample);
    v_result = _mm_max_pd(v_result, v_boundLower);
    v_result = _mm_min_pd(v_result, v_boundUpper);
    v_phase = _mm_add_pd(v_phase, v_result);

    v_pB = _mm_load_pd(pB + 4);
    v_pC = _mm_load_pd(pC + 4);
    v_result = _mm_mul_pd(v_bp0, v_pB);
    v_result = _mm_add_pd(v_result, v_pC);
    v_result = _mm_mul_pd(v_result, v_radiansPerSample);
    v_result = _mm_max_pd(v_result, v_boundLower);
    v_result = _mm_min_pd(v_result, v_boundUpper);
    v_phase = _mm_add_pd(v_phase, v_result);

    v_pB = _mm_load_pd(pB + 6);
    v_pC = _mm_load_pd(pC + 6);
    v_result = _mm_mul_pd(v_bp0, v_pB);
    v_result = _mm_add_pd(v_result, v_pC);
    v_result = _mm_mul_pd(v_result, v_radiansPerSample);
    v_result = _mm_max_pd(v_result, v_boundLower);
    v_result = _mm_min_pd(v_result, v_boundUpper);
    v_phase = _mm_add_pd(v_phase, v_result);

    v_pB = _mm_load_pd(pB + 8);
    v_pC = _mm_load_pd(pC + 8);
    v_result = _mm_mul_pd(v_bp0, v_pB);
    v_result = _mm_add_pd(v_result, v_pC);
    v_result = _mm_mul_pd(v_result, v_radiansPerSample);
    v_result = _mm_max_pd(v_result, v_boundLower);
    v_result = _mm_min_pd(v_result, v_boundUpper);

    // ... fmod?
}

But I'm not really sure how to replace while (phase >= TWOPI) { phase -= TWOPI; } (which is basically a classic fmod in C++).

Any fancy intrinsics? Can't find any on this list. Division + some sort of rocket bit-shifting?

How often do you expect to apply this correction though, and how many two pis do you need to subtract when it does happen? Maybe compare and subtract is still the way to go. (Not that I know SSE2.) — Rup, Jan 02 '19 at 10:18
OK looking more closely I can answer my own question: you're adding at most one pi per iteration, so you should only need to subtract one lot of two pi at most every other iteration. However as far as I can see, phase is not an input into the calculation step, it's just something you accumulate, so you may not even need to correct this inside the loop but instead could just do a single mod 2 pi at the end? Assuming I suppose the block size isn't so large that the total phase is much greater than 2pi and so you'd lose precision by postponing the divide. — Rup, Jan 02 '19 at 10:41
@Rup: `// some other code` will use phase, that's why I would need to check it at every phase increment ;) — markzzz, Jan 02 '19 at 11:11
Are you even sure auto-vectorization does not do it for you? Did you check generated code when you enable SSE2 on your original source (eg, with `-msse2` on g++)? — spectras, Jan 02 '19 at 11:15
@spectras: I'm using MSCV with all optimization enabled, but autovector there its terrible, and also I'm learning vectorization ;) — markzzz, Jan 02 '19 at 11:17
That `while` could be changed into an `if` because `phase` will never be more than 3pi at that point. — interjay, Jan 02 '19 at 11:18
@markzzz I see. You can still use godbolt to check what gcc does and understand (and perhaps improve) upon it. — spectras, Jan 02 '19 at 12:05

score 4 · Accepted Answer · edited Jan 02 '19 at 13:55

4

As comments are saying, it looks like in this you can make it just a masked subtract with a compare + andpd. This works as long as you can never be more than one subtract away from getting back into the desired range.

Like

const __m128d v2pi = _mm_set1_pd(TWOPI);


__m128d needs_range_reduction = _mm_cmpge_pd(vphase, v2pi);
__m128d offset = _mm_and_pd(needs_range_reduction, v2pi);  // 0.0 or 2*Pi
vphase = _mm_sub_pd(vphase, offset);

To implement an actual (slow) fmod without worrying too much about the last few bits of the significand, you'd do integer_quotient = floor(x/y) (or maybe rint(x/y) or ceil), then x - y * integer_quotient. floor / rint / ceil are cheap with SSE4.1 _mm_round_pd or _mm_floor_pd(). That will give you the remainder, which can be negative just like with integer division.

I'm sure there are numerical techniques that better avoid rounding error before the catastrophic cancellation from subtracting two nearby numbers. If you care about precision, go check. (Using double vectors when you don't care much about precision is kind of silly; might as well use float and get twice as much work done per vector). If the input is a lot larger than the modulus, there's an unavoidable loss of precision, and minimizing rounding error in the temporary is probably very important. But otherwise precision will only be a problem unless you care about relative error in results very close to zero when x is almost an exact multiple of y. (Near-zero result, the only the bottom few bits of the significand are left for precision.)

Without SSE4.1, there are tricks like adding then subtracting a large enough number. Converting to integer and back is even worse for pd, because the packed-conversion instruction decodes to some shuffle uops as well. Not to mention that a 32-bit integer doesn't cover the full range of double, but you're screwed for range-reduction precision if your input was that huge.

If you have FMA, you can avoid rounding error in the y * integer_quotient part of the multiply and sub. _mm_fmsub_pd.

edited Jan 02 '19 at 13:55

llllllllll

16,169
4
31
54

answered Jan 02 '19 at 12:50

Peter Cordes

328,167
45
605
847

my whole code above I think its totally wrong. `phase` is comulative for each step, instead I start with both value of `__m128d v_phase` with `_mm_set1_pd(phase);` (both equal, which is wrong). – markzzz Jan 02 '19 at 14:55
@markzzz: sounds a lot like your previous questions, where you want to run a vector of sequential counters. start with `phase` and `phase+step`, and increment by `set1(step*2)` each time. If you don't have (or can factor out) a serial dependency between two vector elements, you're all set. Of course, that makes the max increment 2*pi, I think, so you might need to set a tighter bound or have an alternate loop that repeats the conditional subtract in case it needs to be done multiple times. (e.g. if you unroll even more so you can add up to `4*pi`) – Peter Cordes Jan 02 '19 at 15:00
sure! The problem here is that `step` its not fixed, since it can vary on each iteration :O – markzzz Jan 02 '19 at 15:02
@markzzz: ah right, I just looked at your code again. It's not exactly a serial dependency, but stepping by 2 depends on both intervening elements. You could calculate redundantly with overlapping unaligned loads, I think, and use `pC[i+0]+pC[i+1]` in one element and `pC[i+1] + `pC[i+2]` in another. Same for `pB`. (Your formula, `(mRadiansPerSample * bp0) * pB[i] + mRadiansPerSample * pC[i]` is linear, so I think adding the pC and pB inputs before multiplying will give the same result.) – Peter Cordes Jan 02 '19 at 15:19
It will be somethings like this `phase[0] = (radbp * pB[0] + rad * pC[0]) phase[1] = (radbp * pB[1] + rad * pC[1]) + (radbp * pB[0] + rad * pC[0]) phase[2] = (radbp * pB[2] + rad * pC[2]) + (radbp * pB[1] + rad * pC[1]) phase[3] = (radbp * pB[3] + rad * pC[3]) + (radbp * pB[2] + rad * pC[2]) ...` I believe. But how can you add somethigs which is not aligned? I need to store each block to a single variable, can't store two different result at once.... – markzzz Jan 02 '19 at 16:10
*But how can you add somethigs which is not aligned?* `_mm_load_pd(pC+i)` and `_mm_loadu_pd(pC + i+1)`. – Peter Cordes Jan 03 '19 at 00:21
I'm opening a new question, I'm tripping around it :) – markzzz Jan 03 '19 at 11:22

Using % with SSE2?

1 Answers1

Linked