SIMD Program slow runtime

Question

I'm starting with SIMD programming but i don't know what to do at this moment. I'm trying to diminish runtime but its doing it the other way.

This is my basic code: https://codepaste.net/a8ut89

void blurr2(double * u, double * r) {

    int i;
    double dos[2] = { 2.0, 2.0 };

    for (i = 0; i < SIZE - 1; i++) {
        r[i] = u[i] + u[i + 1];
    }
}

blurr2: 0.43s

int contarNegativos(double * u) {

    int i;
    int contador = 0;

    for (i = 0; i < SIZE; i++) {
        if (u[i] < 0) {
            contador++;
        }
    }
    return contador;
}

negativeCount: 1.38s

void ord(double * v, double * u, double * r) {

    int i;

    for (i = 0; i < SIZE; i += 2) {
        r[i] = *(__int64*)&(v[i]) | *(__int64*)&(u[i]);
    }
}

ord: 0.33

And this is my SIMD code:

https://codepaste.net/fbg1g5

void blurr2(double * u, double * r) {

    __m128d rp2;
    __m128d rdos;
    __m128d rr;
    int i;
    int sizeAux = SIZE % 2 == 1 ? SIZE : SIZE - 1;
    double dos[2] = { 2.0, 2.0 };

    rdos = *(__m128d*)dos;

    for (i = 0; i < sizeAux; i += 2) {
        rp2 = *(__m128d*)&u[i + 1];
        rr = _mm_add_pd(*(__m128d*)&u[i], rp2);
        *((__m128d*)&r[i]) = _mm_div_pd(rr, rdos);
    }
}

blurr2: 0.42s

int contarNegativos(double * u) {

    __m128d rcero;
    __m128d rr;
    int i;
    double cero[2] = { 0.0, 0.0 };
    int contador = 0;

    rcero = *(__m128d*)cero;

    for (i = 0; i < SIZE; i += 2) {
        rr = _mm_cmplt_pd(*(__m128d*)&u[i], rcero);
        if (((__int64 *)&rr)[0]) {
            contador++;
        };
        if (((__int64 *)&rr)[1]) {
            contador++;
        };
    }
    return contador;
}

negativeCount: 1.42s

void ord(double * v, double * u, double * r) {

    __m128d rr;
    int i;

    for (i = 0; i < SIZE; i += 2) {
        *((__m128d*)&r[i]) = _mm_or_pd(*(__m128d*)&v[i], *(__m128d*)&u[i]);
    }
}

ord: 0.35s

**Differents solutions.

Can you explain me what i'm doing wrong? I'm a bit lost...

Please post the [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) that shows the problem. Show the input, the expected output, and the actual output as text *in the question*. I am not going to click on your links. — Weather Vane, Dec 04 '17 at 20:50
Have you checked whether the compiler already auto-vectorized the code for you in the "basic code" case? — geza, Dec 04 '17 at 21:18
blurr2: multiply by `0.5` instead of dividing by 2. It will be *much* faster. (I commented the same thing on a question with the exact same code in the last day or two, was that also you?) — Peter Cordes, Dec 04 '17 at 23:18
@PeterCordes Oops, old code, i changed it and it's faster! (yes, but they deleted my question dont know why...) — JoniValles, Dec 04 '17 at 23:36

score 2 · Answer 1 · answered Dec 04 '17 at 23:46

Use _mm_loadu_pd instead of pointer-casting and dereferencing a __m128d. Your code is guaranteed to segfault on gcc/clang where __m128d is assumed to be aligned.

blurr2: multiply by 0.5 instead of dividing by 2. It will be much faster. (I commented the same thing on a question with the exact same code in the last day or two, was that also you?)

negativeCount: _mm_castpd_si128 the compare result to integer, and accumulate it with _mm_sub_epi64. (The bit pattern is all-zero or all-one, i.e. 2's complement 0 / -1).

#include <immintrin.h>
#include <stdint.h>

static const size_t SIZE = 1024;

uint64_t countNegative(double * u) {
    __m128i counts = _mm_setzero_si128();
    for (size_t i = 0; i < SIZE; i += 2) {
        __m128d cmp = _mm_cmplt_pd(_mm_loadu_pd(&u[i]), _mm_setzero_pd());
        counts = _mm_sub_epi64(counts, _mm_castpd_si128(cmp));
    }

    //return counts[0] + counts[1];  // GNU C only, and less efficient
    // horizontal sum
    __m128i hi64  = _mm_shuffle_epi32(counts, _MM_SHUFFLE(1, 0, 3, 2));
    counts = _mm_add_epi64(counts, hi64);

    uint64_t scalarcount = _mm_cvtsi128_si64(counts);
    return scalarcount;
}

To learn more about efficient vector horizontal sums, see Fastest way to do horizontal float vector sum on x86. But the first rule is to do it outside the loop.

(source + asm on the Godbolt compiler explorer)

From MSVC (which I'm guessing you're using, or you'd get segfaults from *(__m128d*)foo), the inner loop is:

$LL4@countNegat:
    movups   xmm0, XMMWORD PTR [rcx]
    lea      rcx, QWORD PTR [rcx+16]
    cmpltpd xmm0, xmm2
    psubq    xmm1, xmm0
    sub      rax, 1
    jne      SHORT $LL4@countNegat

It could maybe go faster with unrolling (and maybe two vector accumulators), but this is fairly good and might go close to 1.25 clocks per 16 bytes on Sandybridge/Haswell. (Bottleneck on 5 fused-domain uops).

Your version was actually unpacking to integer inside the inner loop! And if you were using MSVC -Ox, it was actually branching instead of using a branchless compare + conditional add. I'm surprised it wasn't slower than the scalar version.

Also, (int64_t *)&rr violates strict aliasing. char* can alias anything, but it's not safe to cast other pointers onto SIMD vectors and expect it to work. If it does, you got lucky. Compilers usually generate similar code for that or intrinsics, and usually not worse for proper intrinsics.

score 1 · Answer 2 · answered Dec 04 '17 at 21:43

Do you know that ord function with SIMD is not 1:1 to ord function without using SIMD instructions ?

In ord function without using SIMD, result of OR operation is calculated for even indexes

r[0] = v[0] | u[0], 
r[2] = v[2] | u[2], 
r[4] = v[4] | u[4]

what with odd indexes? maybe, if OR operations are calculated for all indexes, it will take more time than now.

SIMD Program slow runtime

And this is my SIMD code:

2 Answers2