How to read optimally from an array (in memory) having array position from a vector?

Question

I've such a code:

const rack::simd::float_4 pos = phase * waveTable.mLength;
const rack::simd::int32_4 pos0 = pos;
const rack::simd::float_4 frac = pos - (rack::simd::float_4)pos0;
rack::simd::float_4 v0;
rack::simd::float_4 v1;
for (int v = 0; v < 4; v++) {
    v0[v] = waveTable.mSamples[pos0[v]];
    v1[v] = waveTable.mSamples[pos0[v] + 1]; // mSamples size is waveTable.mLength + 1 for interpolation wraparound
}

oversampleBuffer[i] = v0 + (v1 - v0) * frac;

which takes phase (normalized) and interpolate (linear interpolation) between two samples, stored on waveTable.mSamples (as single float each).

The positions are insides rack::simd::float_4, which are basically 4 aligned float defined as __m128. That part of code, after some benchmarks, takes some times (I guess due to lot of cache missing).

Is built with -march=nocona, so I can use MMX, SSE, SSE2 and SSE3.

How would you optimize this code? Thanks

With current context there is nothing to optimize. When I would try to optimize this I would do: 1. Write performance test (to measure that changes are giving some gain) 2. Play with optimizations flags. 3. Profile code to find bottle necks. — Marek R, Apr 13 '21 at 14:41
Also, bear in mind that – unless this is embedded hardware – "memory is *virtual."* — Mike Robinson, Apr 13 '21 at 15:06
but how would you read from array in memory (probably, in heap, since they are huge)? — markzzz, Apr 13 '21 at 15:23
Which architecture are you targetting? On mainstream x86 architecture (with AVX-256), the gather is almost not SIMDified, so it should not be much faster than a scalar code (possibly even slower). The same apply for ARM (except maybe for the very new SVE instruction set). What are the typical value in `pos`? — Jérôme Richard, Apr 13 '21 at 15:56
Thank you. Out of curiosity: are you really running your code on the 17-year-old Nocona Intel architecture or it is just for backward-compatibility? Note that you can use `-mtune` if this is just for backward-compatibility ;) . — Jérôme Richard, Apr 13 '21 at 17:34
@JérômeRichard working on "in box" framework, which is compiled that way. For keep compatibility, the DLL must be compiled that way. — markzzz, Apr 13 '21 at 19:22
I'd suggest `-march=nocona -mtune=haswell` (or even `-mtune=generic`). It will still run on a Nocona, but code-gen choices will tune for a modern CPU, not for P4 quirks while ignoring things that modern CPUs might care about. This includes setting thresholds for heuristics on when to inline, and (with `-fprofile-use`) when to unroll loops. — Peter Cordes, Apr 13 '21 at 20:31
@markzzz I don’t think instruction set is a part of DLL ABI. There’re couple issues with AVX because different instructions encoding, false dependency shenanigans, and warmup latency on some CPUs, but I think SSE 4.1 is fine without any caveats. — Soonts, Apr 13 '21 at 22:58
Is `phase` increasing (slowly) making `pos0` often be a constant or a have a difference of just 1? Or is it more or less random? — chtz, Apr 15 '21 at 00:49
@chtz it increase at freq / samplerate / oversampling. so phase = currentPhase + (i * phaseIncrementUpsampling); — markzzz, Apr 15 '21 at 13:30
And what are typical values for `phaseIncrementUpsampling * waveTable.mLength`? (The code in your question is far away from a [mre] ...) — chtz, Apr 15 '21 at 14:11
@chtz length is 4096. phaseIncrementUpsampling is phaseIncrement /= 8 — markzzz, Apr 15 '21 at 14:57

score 3 · Accepted Answer · edited Apr 13 '21 at 20:45

3

Your code is not too efficient for multiple reasons.

You’re setting individual lanes of SIMD vectors with scalar code. Processors can’t quite do that, but the compiler pretends they can. Unfortunately, these compiler-implemented workarounds are slow, typically they do that with a rountrip to memory and back.
Generally, you should avoid writing loops of very small lengths of 2 or 4. Sometimes compiler unrolls and you’re OK, but other time they don’t and the CPU is mispredicting way too many branches.
Finally, processors can load 64-bit values with a single instruction. You’re loading consecutive pairs from the table, can use 64-bit loads instead of two 32-bit ones.

Here’s a fixed version (untested). This assumes you’re building for PCs i.e. using SSE SIMD.

// Load a vector with rsi[ i0 ], rsi[ i0 + 1 ], rsi[ i1 ], rsi[ i1 + 1 ]
inline __m128 loadFloats( const float* rsi, int i0, int i1 )
{
    // Casting load indices to unsigned, otherwise compiler will emit sign extension instructions
    __m128d res = _mm_load_sd( (const double*)( rsi + (uint32_t)i0 ) );
    res = _mm_loadh_pd( res, (const double*)( rsi + (uint32_t)i1 ) );
    return _mm_castpd_ps( res );
}

__m128 interpolate4( const float* waveTableData, uint32_t waveTableLength, __m128 phase )
{
    // Convert wave table length into floats.
    // Consider doing that outside of the inner loop, and passing the __m128.
    const __m128 length = _mm_set1_ps( (float)waveTableLength );

    // Compute integer indices, and the fraction
    const __m128 pos = _mm_mul_ps( phase, length );
    const __m128 posFloor = _mm_floor_ps( pos );    // BTW this one needs SSE 4.1, workarounds are expensive
    const __m128 frac = _mm_sub_ps( pos, posFloor );
    const __m128i posInt = _mm_cvtps_epi32( posFloor );

    // Abuse 64-bit load instructions to load pairs of values from the table.
    // If you have AVX2, can use _mm256_i32gather_pd instead, will load all 8 floats with 1 (slow) instruction.
    const __m128 s01 = loadFloats( waveTableData, _mm_cvtsi128_si32( posInt ), _mm_extract_epi32( posInt, 1 ) );
    const __m128 s23 = loadFloats( waveTableData, _mm_extract_epi32( posInt, 2 ), _mm_extract_epi32( posInt, 3 ) );

    // Shuffle into the correct order, i.e. gather even/odd elements from the vectors
    const __m128 v0 = _mm_shuffle_ps( s01, s23, _MM_SHUFFLE( 2, 0, 2, 0 ) );
    const __m128 v1 = _mm_shuffle_ps( s01, s23, _MM_SHUFFLE( 3, 1, 3, 1 ) );

    // Finally, linear interpolation between these vectors.
    const __m128 diff = _mm_sub_ps( v1, v0 );
    return _mm_add_ps( v0, _mm_mul_ps( frac, diff ) );
}

The assembly looks good. Modern compilers even automatically use FMA, when available. (GCC by default, clang with -ffp-contract=fast to do contraction across C statements, not just within one expression.)

Just saw the update. Consider switching your target to SSE 4.1. Steam hardware survey says the market penetration is 98.76%. If you still supporting prehistoric CPUs like Pentium 4, the workaround for _mm_floor_ps is in DirectXMath, and instead of _mm_extract_epi32 you can use _mm_srli_si128 + _mm_cvtsi128_si32.

Even if you need to support an old baseline like only SSE3, -mtune=generic or even -mtune=haswell might be a good idea along with -march=nocona, to still make inlining and other code-gen choices that are good for a range of CPUs, not just Pentium 4.

edited Apr 13 '21 at 20:45

Peter Cordes

328,167
45
605
847

answered Apr 13 '21 at 17:13

Soonts

20,079
9
57
130

Thanks for the reply :) not sure about s23: why should I extract elements at pos + 2 and + 3? Mmm... – markzzz Apr 13 '21 at 17:28
@markzzz The first `loadFloats` call loads a vector with table entries at `posInt[0]`, `posInt[0]+1`, `posInt[1]`, `posInt[1]+1`. The second `loadFloats` loads a vector with entries at `posInt[2]`, `posInt[2]+1`, `posInt[3]`, `posInt[3]+1`. Afterwards, the first shuffle instruction is shuffling them into `posInt[0]`, `posInt[1]`, `posInt[2]`, `posInt[3]`, the second one makes `posInt[0]+1`, `posInt[1]+1`, `posInt[2]+1`, `posInt[3]+1`. – Soonts Apr 13 '21 at 17:34
@Soonts Great answer. I guess the `loadFloats`+`_mm_shuffle_ps` part is clearly the most expensive. I am wondering if a simple SIMD store of `posInt` followed by several scalar loads storing data in a contiguous memory location then reloaded using a simple `_mm_load_ps` would be better for this SSE code. Maybe the instruction-level parallelism of scalar instructions could be better (or better pipelined). What do you think about this? Is this a good idea? – Jérôme Richard Apr 13 '21 at 17:44
1

@JérômeRichard You can try, but I doubt it’s gonna be faster. These things aren’t too expensive. `_mm_shuffle_ps` has 1 cycle of latency, the throughput is even better, my CPU can run 2 of them every cycle. `_mm_extract_epi32` has 3 cycles of latency. Even L1D cache is slower, I tend to avoid RAM access whenever possible. – Soonts Apr 13 '21 at 17:54
@Soonts Ok. Thank you. – Jérôme Richard Apr 13 '21 at 18:15
Awesome man! Thats magic :) Got it! Now I try to implement it. In case I come back :P Thanks for now! – markzzz Apr 13 '21 at 19:21
@markzzz Make 100% sure all elements of your `phase` vector are non-negative, and less than `1.0f`. Otherwise, the code gonna fail loading outside of the table, just like your original one does. – Soonts Apr 13 '21 at 19:25
@JérômeRichard: `movd` + 3x `pextrd` is not great (1 + 3x2 uops on Intel SnB-family). Vector store + 3x scalar reload is worth considering for the high 3 elements, after a `movd` for the low element which has lower latency, giving OoO exec something to get started on sooner. OTOH, with all the "gather" loads there's already a lot of load-port uops when you look at throughput, but also a lot of port5 uops with the shuffles. – Peter Cordes Apr 13 '21 at 20:50
1

@JérômeRichard: A good tradeoff might be `pextrq` for elements 2 and 3, to get a `uint64_t` that you split up with scalar. (`mov r32, r32` to zero extend the low half, and `shr reg, 32`). Without SSE4.1 especially, maybe `movq` + `pextrq` and then just scalar to split up those two halves. (Simple scalar instructions can run on more ports than the xmm->int transfer unit on only port 0 on Intel, so yes this can be better for back-end throughput, and I think we can pull it off with similar total front-end uops.) – Peter Cordes Apr 13 '21 at 20:53
1

@JérômeRichard: you mentioned reloading something with `_mm_load_ps`, possibly after scalar stores if that's what you meant to write? That would cause a store-forwarding stall, unlike vector store / scalar reload. It seems on Intel CPUs, multiple store-forwarding stalls can't pipeline with each other (https://easyperf.net/blog/2018/03/09/Store-forwarding#one-more-interesting-experiment), so that would be a poor choice, not just extra latency for OoO exec to hide. For the gathers, `movsd` / `movhps` to feed 2x `shufps` is 3 total shuffle uops, so 3c throughput on Haswell/SKL. – Peter Cordes Apr 13 '21 at 20:57
2

@PeterCordes Shuffles can probably be improved on Intel. If the OP will settle on SSE 4.1, can replace `_mm_loadh_pd` with `_mm_loaddup_pd` + `_mm_blend_pd`, that’s an extra instruction but I think better port usage on Skylake. – Soonts Apr 13 '21 at 22:49
1

@Soonts: Yeah, good point. It's too bad that `movhps` / `movhpd` don't decode that way in the first place on modern Intel where the load ports take care of broadcasts. But `blendpd xmm, [mem], imm` (or `blendvpd`) can't micro-fuse the memory source operand so maybe it was a choice between single-uop in the front-end vs. better back-end port distribution. (And if the opcode can sometimes need 2 uops, [that might mean it always needs the "complex" legacy decoder](https://stackoverflow.com/questions/61980149/can-the-simple-decoder#comment115735214_64036743), so Intel would want to avoid that) – Peter Cordes Apr 13 '21 at 22:56
@Soonts that's typo? const __m128i posInt = _mm_cvtps_epi32( posFloor );? Shouldn't be const __m128i posInt = _mm_cvtps_epi32( pos );? Also: shouldn't be used _mm_cvttps_epi32 instead of _mm_cvtps_epi32? (i.e. truncate towards zero) – markzzz Apr 14 '21 at 07:29
@Soonts also tried to convert _mm_extract_epi32 instead in _mm_srli_si128 + _mm_cvtsi128_si32, but I think I do it wrong: _mm_cvtsi128_si32(_mm_srli_si128(posInt, 1). What's wrong? It crash at runtime... – markzzz Apr 14 '21 at 10:30
@markzzz Not a typo. `_mm_cvtps_epi32` doesn't necessarily rounds towards zero, the behavior depends on MXCSR register i.e. can be anything. That's why the code calls _mm_floor_ps first, and converts that to integers. – Soonts Apr 14 '21 at 12:47
The second argument of _mm_srli_si128 is shift amount in bytes, not in 32-bit lanes. To get the second lane pass 4, to get the last lane pass 12. – Soonts Apr 14 '21 at 12:48
Got it. Thanks! Anyway, I gained less than 1% with your code, not that much unfortunately :( – markzzz Apr 14 '21 at 13:10
@markzzz My code, or SSE2 port of that? :-) – Soonts Apr 14 '21 at 13:51
@Soonts tried just the port :) But I don't think an api instead of 2 will increase by 10x :P I'll try anyway... – markzzz Apr 14 '21 at 14:50
@markzzz You should try the SSE 4.1 version. The rounding workaround is rather slow, it almost doubles the count of instructions in that function. – Soonts Apr 14 '21 at 14:54
@Soonts what should I do to test those SSE 4.1 instructions? Change the -march to? And should I need to include the .h for each required function (i.e. )? So I can compare the gain... – markzzz Apr 14 '21 at 17:24
@markzzz I would do it like that: `-O3 -march=core2 -msse4.1 -mtune=skylake` – Soonts Apr 14 '21 at 20:06
@Soonts tried. 1% less, I hope for somethings more :) – markzzz Apr 15 '21 at 08:20

How to read optimally from an array (in memory) having array position from a vector?

1 Answers1