0

I needed cross-platform SIMD instruction between arm and x86. So i found a library called libsimdpp and ran this example.

I changed it a little to compare it to just a standard cpp approach for adding two arrays, but the libSimd example always performed worse.

Results

  • 23 Milliseconds - libSimd
  • 1 Milliseconds - normal cpp additation

Is this something wrong with the way I'm using the library or how it's built.

The changes i made to the example.

https://pastebin.com/L14DCrky

#define SIMDPP_ARCH_X86_SSE4_1 true
#include <simdpp/simd.h>
#include <iostream>
#include <chrono>
//example where i got this from
//https://github.com/p12tic/libsimdpp/tree/2e5c0464a8069310d7eb3048e1afa0e96e08f344

// Initializes vector to store values
void init_vector(float* a, float* b, size_t size) {
    for (int i=0; i<size; i++) {
        a[i] = i * 1.0;
        b[i] = (size * 1.0) - i - 1;
    }
}



using namespace simdpp;
int main() {
    //1048576
    const unsigned long SIZE = 4 * 150000;

    float vec_a[SIZE];
    float vec_b[SIZE];
    float result[SIZE];

    ///////////////////////////*/
    //LibSIMDpp
    //*
    auto t1 = std::chrono::high_resolution_clock::now();

    init_vector(vec_a, vec_b, SIZE);
    for (int i=0; i<SIZE; i+=4) {
        float32<4> xmmA = load(vec_a + i);  //loads 4 floats into xmmA
        float32<4> xmmB = load(vec_b + i);  //loads 4 floats into xmmB
        float32<4> xmmC = add(xmmA, xmmB);  //Vector add of xmmA and xmmB
        store(result + i, xmmC);            //Store result into the vector
    }

    auto t2 = std::chrono::high_resolution_clock::now();

    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count()
              << " milliseconds\n";
    //*/


    ///////////////////////////*/
    //standard
    //*
    init_vector(vec_a, vec_b, SIZE);
    t1 = std::chrono::high_resolution_clock::now();

    for (auto i = 0; i < SIZE; i++) {
        result[i] = vec_a[i]  + vec_b[i];
    }

    t2 = std::chrono::high_resolution_clock::now();

    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count()
              << " milliseconds\n";
    //*/


    int i = 0;
    return 0;
}
  • 3
    Recommend adding your build commands. It's mostly for completeness, but someone may spot an "Ooops!". – user4581301 Mar 01 '19 at 16:34
  • In the [example](https://github.com/p12tic/libsimdpp/tree/2e5c0464a8069310d7eb3048e1afa0e96e08f344/examples) there's a make file that I used. The above code is just modified code from that example. – used_up_user Mar 01 '19 at 16:38
  • E.g., an "Ooops" like not enabling optimization ^^. Also for gcc: sometimes the main function does not get fully optimized. Try putting every performance critical code into a separate function. And have a look at the generated assembly (compile with `-O2 -S`) – chtz Mar 01 '19 at 16:39
  • Besides enabling optimization: In the first case you have `init_vector` inside the timing. This likely takes up more time than the addition afterwards. – chtz Mar 01 '19 at 17:06

1 Answers1

2

It's normal that a debug build slows down manually-vectorized code more than it slows down scalar even if you use _mm_add_ps intrinsics directly. (Usually because you tend to use more separate statements, and debug code-gen compiles each statement separately.)

You're using a C++ wrapper library, so in debug mode that's a significant extra layer of stuff that won't optimize away because you told the compiler not to. So it's not surprising that slows it down so much that it's worse than scalar. See Why is this C++ wrapper class not being inlined away? for example. (Even __attribute__((always_inline)) doesn't help performance much; passing args still results in reload/store to make another copy).

Don't benchmark debug builds, it's useless and tells you very little about -O3 performance. (You might also want to use -O3 -march=native -ffast-math, depending on your use-case.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847