SIMD intrinsics slower for cross products over an array of points than whatever GCC -O3 -march=native does on its own?

Question

I try to use SIMD(x86 immintrin.h) to speed up my math, the code looks like this:

#include <float.h>
#include <immintrin.h>
#include <cmath>

class Point2 {
public:
    Point2() = default;
    Point2(double xx, double yy):x_(xx), y_(yy) {}
    // SIMD
    inline double CrossSIMD(Point2 other) {
        __m128d a = _mm_load_pd(&x_);
        __m128d _other = _mm_set_pd(other.y_, -other.x_);
        __m128d c = _mm_mul_pd(a, _other);
        double temp[2];
        _mm_store_pd(&temp[0], c);
        return temp[0] + temp[1];        
    }
    // None SIMD
    inline double Cross(Point2 other) {
       return x_* other.y_ - y_ * other.x_;
    }
private:
    double x_ = 0.;
    double y_ = 0.;
} __attribute__((aligned(16)));

int main() {
    double sum_cross = 0.;
    {
        Timer test_1("Cross");
        for(int i = 0; i < kLoop; ++i) {
            int index_1x = i & 1023;
            int index_1y = (i + 1) & 1023;
            int index_2x = (i + 2) & 1023;
            int index_2y = (i + 3) & 1023;
            sum_cross += Point2(array[index_1x], array[index_1y]).Cross(Point2(array[index_2x], array[index_2y]));
        }
    }
    std::cout << sum_cross << std::endl;
    double sum_simd = 0.;
    {
        Timer test_1("SIMD");
        for(int i = 0; i < kLoop; ++i) {
            int index_1x = i & 1023;
            int index_1y = (i + 1) & 1023;
            int index_2x = (i + 2) & 1023;
            int index_2y = (i + 3) & 1023;
            sum_simd += Point2(array[index_1x], array[index_1y]).CrossSIMD(Point2(array[index_2x], array[index_2y]));
        }
    }
    std::cout << sum_simd << std::endl;
    std::cout << sum_simd - sum_cross << std::endl;
    return 0;    
}

Compile with gcc7.5.0 on Linux with options:

g++ -o cross_simd cross_simd.cpp -O3 -march=native

But profiling shows that Operator* is 3 times slower than Cross function. And I try to read Compiler intermediates (.s files). But I still can't find the reason for the slowness.

Intel Haswell/Skylake CPUs have 2/clock `mulsd` throughput but 1/clock shuffle throughput (`unpcklpd` / `unpckhpd`). https://uops.info/ / https://agner.org/optimize/ and [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391). If you can enable `-march=x86-64-v3` (or haswell / znver1), FMA can nicely combine the 2nd mulsd with the subsd. Oh, you did use `-march=native` locally, but not `-march=whatever` on Godbolt. So your shuffles are competing against FMA if your CPU has it — Peter Cordes, Nov 14 '22 at 11:38
Not quite a duplicate of [Add+Mul become slower with Intrinsics - where am I wrong?](https://stackoverflow.com/q/53834135) despite similar question; the answer there optimized differently, avoiding shuffles. — Peter Cordes, Nov 14 '22 at 11:50
Instead of vectorizing each cross product separately, do 2 or 4 cross products in parallel, ideally by storing your data in an array of x[] values and a separate array of y[] values. Oh, except you're mixing and matching x and y values to try to defeat normal vectorization. But see also [Need some constructive criticism on my SSE/Assembly attempt](https://stackoverflow.com/q/2923458). — Peter Cordes, Nov 14 '22 at 12:03
What CPU are you optimizing for? And in your real problem, is there any possibility to work on multiple cross products at once, from multiple points? I'm pretty sure these loops aren't your real problem. — Peter Cordes, Nov 14 '22 at 12:08
I was going to look at how these functions inlined into your loop, but it's not a [mcve]; multiple missing definitions place even after commenting out a couple things like `Timer`. https://godbolt.org/z/ac8W9vYW8 . Basically a duplicate of an instruction-cost Q&A: shuffles cost more than multiplies on many CPUs, in terms of throughput. Also not a [mcve] because you only show asm for the stand-alone version, not how it inlines into the loop, or with `-march=native`. Could get reopened if some edits change that and there's more to say about details here. — Peter Cordes, Nov 14 '22 at 12:10
Why `-other.x_` when you could for free replace the addition with a subtraction later? Were you trying to set things up for `_mm_dp_pd`? — Marc Glisse, Nov 14 '22 at 12:37

SIMD intrinsics slower for cross products over an array of points than whatever GCC -O3 -march=native does on its own?

0 Answers0