Why does this simple C++ SIMD benchmark run slower when SIMD instructions are used?

Question

I'm thinking about writing a SIMD vector math library, so as a quick benchmark I wrote a program that does 100 million (4 float) vector element-wise multiplications and adds them to a cumulative total. For my classic, non-SIMD variation I just made a struct with 4 floats and wrote my own multiply function "multiplyTwo" that multiplies two such structs element wise and returns another struct. For my SIMD variation I used "immintrin.h" along with __m128, _mm_set_ps, and _mm_mul_ps. I'm running on an i7-8565U processor (whiskey lake) and compiling with: g++ main.cpp -mavx -o test.exe to enable the AVX extension instructions in GCC.

The weird thing is that the SIMD version takes about 1.4 seconds, and the non-SIMD version takes only 1 second. I feel as though I'm doing something wrong, as I thought the SIMD version should run 4 times faster. Any help is appreciated, the code is below. I've placed the Non-SIMD code in comments, the code in it's current form is the SIMD version.

#include "immintrin.h" // for AVX 
#include <iostream>

struct NonSIMDVec {
    float x, y, z, w;
};

NonSIMDVec multiplyTwo(const NonSIMDVec& a, const NonSIMDVec& b);

int main() {
    union { __m128 result; float res[4]; };
    // union { NonSIMDVec result; float res[4]; };

    float total = 0; 
    for(unsigned i = 0; i < 100000000; ++i) {
        __m128 a4 = _mm_set_ps(0.0000002f, 1.23f, 2.0f, (float)i);
        __m128 b4 = _mm_set_ps((float)i, 1.3f, 2.0f, 0.000001f);
        // NonSIMDVec a4 = {0.0000002f, 1.23f, 2.0f, (float)i}; 
        // NonSIMDVec b4 = {(float)i, 1.3f, 2.0f, 0.000001f};

        result = _mm_mul_ps(a4, b4); 
        // result = multiplyTwo(a4, b4);

        total += res[0];
        total += res[1];
        total += res[2];
        total += res[3];
    }

    std::cout << total << '\n';
}

NonSIMDVec multiplyTwo(const NonSIMDVec& a, const NonSIMDVec& b)
{ return {a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w}; }

Compilers these days are *very* good at optimizations. Perhaps it's simply able to create more optimal code than you think? You could take a look at the generated code of the two different executables, and compare to see what the compiler does on its own compared to your code. — Some programmer dude, Oct 13 '19 at 17:03
You forgot to enable optimization. Manual vectorization takes more statements and is often worse at `-O0`. Oh, holy pixels Batman, you also put your horizontal sum *inside* the loop. — Peter Cordes, Oct 13 '19 at 17:05
That's also an abuse of `_mm_set_ps` by the way, good codegen cannot be expected if you use it like that — harold, Oct 13 '19 at 17:14
I also tried running both with `-O3` and got the same amount of time to run. Does this mean in many simple cases (like my benchmark), it's not even worth it to write explicit SIMD vectorization? — Montana, Oct 13 '19 at 17:15
Building the two examples (with `-O2` enabled) does indeed generate some different code (as seen on e.g. https://godbolt.org/z/z7abDp). These differences could explain the differences in run-time. And as can be seen, the compiler does generate SIMD instructions itself. General hand-made optimizations are in most cases not better than what a modern compiler can create, so please leave it to the compiler. Then benchmark and profile to find the few top bottlenecks and concentrate your optimization efforts on those (if it's not already "good enough"). — Some programmer dude, Oct 13 '19 at 17:19
**You still bottleneck on the same FP add latency because you're doing the horizontal add inside the loop even in the SIMD version**. (Possibly `-ffast-math` would let the compiler reorder the FP operations and use `vaddps` inside the loop, but mostly likely you just need to fix your source code.) It's very possible to get at least a 4x speedup here, but only if you think about the bottlenecks in the resulting asm. (Loop-carried dependency chains, and the throughput cost of `_mm_set_ps`; most compilers are not good at optimizing SIMD shuffles. clang might.). — Peter Cordes, Oct 13 '19 at 17:19
@Montana this benchmark is accidentally complicated, by mixing constants and variables into the same vector. The slow sum can be fixed easily. Likely if you had chosen some real code to test SIMD on, it would have worked out better, because it wouldn't have needed the artificially-odd usage of `_mm_set_ps` — harold, Oct 13 '19 at 17:24

Peter Cordes · Answer 1 · 2019-10-13T17:56:48.390

With optimization disabled (the gcc default is -O0), intrinsics are often terrible. Anti-optimized -O0 code-gen for intrinsics usually hurts a lot (even more than for scalar), and some of the function-like intrinsics introduce extra store/reload overhead. Plus the extra store-forwarding latency of -O0 tends to hurt more because there's less ILP when you do things with 1 vector instead of 4 scalars.

Use gcc -march=native -O3

But even with optimization enabled, your code is still written to destroy the performance of SIMD by doing a horizontal add of each vector inside the loop. See How to Calculate Vector Dot Product Using SSE Intrinsic Functions in C for how to not do that: use _mm_add_ps to accumulate a __m128 total vector, and only horizontal sum it outside the loop.

You bottleneck your loop on FP-add latency by doing scalar total += inside the loop. That loop-carried dependency chain means your loop can't run any faster than 1 float per 4 cycles on your Skylake-derived microarchitecture where addss latency is 4 cycles. (https://agner.org/optimize/)

Even better than __m128 total, use 4 or 8 vectors to hide FP add latency, so your SIMD loop can bottleneck on mul/add (or FMA) throughput instead of latency.

Once you fix that, then as @harold points out the way you're using _mm_set_ps inside the loop will result in pretty bad asm from the compiler. It's not a good choice inside a loop when the operands aren't constants, or at least loop-invariant.

Your example here is clearly artificial; normally you'd be loading SIMD vectors from memory. But if you did need to update a loop counter in a __m128 vector, you might use tmp = _mm_add_ps(tmp, _mm_set_ps(1.0, 0, 0, 0)). Or unroll with adding 1.0, 2.0, 3.0, and 4.0 so the loop-carried dependency is only the += 4.0 in the one element.

x + 0.0 is the identity operation even for FP (except maybe with signed zero) so you can do it to the other elements without changing them.

Or for the low element of a vector, you can use _mm_add_ss (scalar) to only modify it.

Why does this simple C++ SIMD benchmark run slower when SIMD instructions are used?

1 Answers1

Linked