-1

here's a snippet of code I have:

for (int oscIndex = 0; oscIndex < kNumOscs; oscIndex++) {
    for (int voiceIndex = 0; voiceIndex < numVoices; voiceIndex += 4) {
        const int v = voiceIndex / 4;

        // vol
        osc[oscIndex][v] = _mm_mul_ps(osc[oscIndex][v], vol[oscIndex][v]);

        // prev output
        mPrevOutput[oscIndex][v] = osc[oscIndex][v];

        // out
        osc[oscIndex][v] = _mm_mul_ps(osc[oscIndex][v], out[oscIndex][v]);
    }
}

is it correct to copy values on mPrevOutput in this way? or a (unique) memcpy will result faster?

mPrevOutput and osc have the same length (in this case, kNumOscs=4 x numVoices=16 x m128).

I'm on a windows/64 bit machine, using FLAGS += -O3 -march=nocona -funsafe-math-optimizations

That's how they are defined:

alignas(16) std::array<std::array<m128, 4>, kNumOscs> mPrevOutput; // member of a class
m128 osc[4][4]; // declared every time the function's class is executed
markzzz
  • 47,390
  • 120
  • 299
  • 507
  • 1
    What's the declaration of `osc` and `mPrevOutput`? – Sebastian Redl Sep 02 '21 at 07:59
  • @SebastianRedl added the details. – markzzz Sep 02 '21 at 08:03
  • 4
    So generally speaking, the answer to the question "is X faster than Y" is *always* "write both and use a profiler to find out". – Sebastian Redl Sep 02 '21 at 08:10
  • You're tuning for 64-bit Pentium 4? (`-march=nocona` implies `-mtune=nocona`). If you want nocona as a feature baseline, you could use `-march=nocona -mtune=haswell`, or even `-mtune=generic` since arch=nocona doesn't exclude any old CPUs that tune=generic cares about. (e.g. `-march=haswell -mtune=generic` would be [a less-good choice](https://stackoverflow.com/questions/52626726/why-doesnt-gcc-resolve-mm256-loadu-pd-as-single-vmovupd) because generic tuning options care about a lot of CPUs that don't have AVX2 / BMI2, ) – Peter Cordes Sep 02 '21 at 18:50

1 Answers1

1

It should not matter. m128 types should use SSE operations, so assignment is fast. memcpy should be implemented as intrinsic, so should do the same.

But it all is up to the compiler and compilation options. Profile, inspect disassembly.

Alex Guteniev
  • 12,039
  • 2
  • 34
  • 79