SSE intrinsics without compiler optimization

Question

I am new to SSE intrinsics and try to optimise my code by it. Here is my program about counting array elements which are equal to the given value.

I changed my code to SSE version but the speed almost doesn't change. I am wondering whether I use SSE in a wrong way...

This code is for an assignment where we're not allowed to enable compiler optimization options.

No SSE version:

int get_freq(const float* matrix, float value) {

    int freq = 0;

    for (ssize_t i = start; i < end; i++) {
        if (fabsf(matrix[i] - value) <= FLT_EPSILON) {
            freq++;
        }
    }

    return freq;
}

SSE version:

#include <immintrin.h>
#include <math.h>
#include <float.h>

#define GETLOAD(n) __m128 load##n = _mm_load_ps(&matrix[i + 4 * n])
#define GETEQU(n) __m128 check##n = _mm_and_ps(_mm_cmpeq_ps(load##n, value), and_value)
#define GETCOUNT(n) count = _mm_add_ps(count, check##n)

    int get_freq(const float* matrix, float givenValue, ssize_t g_elements) {

        int freq = 0;
        int i;

        __m128 value = _mm_set1_ps(givenValue);
        __m128 count = _mm_setzero_ps();
        __m128 and_value = _mm_set1_ps(0x00000001);


        for (i = 0; i + 15 < g_elements; i += 16) {
            GETLOAD(0); GETLOAD(1); GETLOAD(2); GETLOAD(3);
            GETEQU(0);  GETEQU(1);  GETEQU(2);  GETEQU(3);
            GETCOUNT(0);GETCOUNT(1);GETCOUNT(2);GETCOUNT(3);
        }

        __m128 shuffle_a = _mm_shuffle_ps(count, count, _MM_SHUFFLE(1, 0, 3, 2));
        count = _mm_add_ps(count, shuffle_a);
        __m128 shuffle_b = _mm_shuffle_ps(count, count, _MM_SHUFFLE(2, 3, 0, 1));
        count = _mm_add_ps(count, shuffle_b);
        freq = _mm_cvtss_si32(count);


        for (; i < g_elements; i++) {
            if (fabsf(matrix[i] - givenValue) <= FLT_EPSILON) {
                freq++;
            }
        }

        return freq;
    }

Why do you care about the performance, the vector code isn't even *correct* yet. — EOF, May 14 '16 at 08:34
You don't compare `fabs(x) <= FLT_EPSILON`, you compare `x != 0.0f`, which is not at all equivalent to the non-vectorized version. — EOF, May 14 '16 at 08:51
I do `fabs(x) <= FLT_EPSILON` due to floating point precision error. But when I change to SSE version, I found two floating point could be compared exactly (still not sure, but in my test, it works well). And in `_mm_cmpeq_ps(load##n, value)`, I compare the array elements with given value, which means comparing `x == 0.0f`. I don't understand why u get `x != 0.0f`, may you explain more? — Jennifer Q, May 14 '16 at 09:07
Whether you compare `x != 0.0f` or `x == 0.0f` is not the point. Neither is equivalent to `fabs(x) <= FLT_EPSILON`. And no, SSE doesn't magically make floating-point math infinitely precise. I'd say that your tests are bad and don't exercise the case where `FLT_EPSILON` is relevant. Of course that test is of questionable merit anyway, but I' not the right person to give lectures about this topic. — EOF, May 14 '16 at 09:11
ok, I get the point. The precision is really a problem in my code. I will try to fix it. Thank u for your suggestion. — Jennifer Q, May 14 '16 at 09:21
Well, I'd also suggest you look into what `FLT_EPSILON` actually *means*, and how that relates to floating-point inaccuracies. Hint: It's not suitable as an absolute error in the difference between two values in general. — EOF, May 14 '16 at 09:24
Now it compiles, [and the horrible asm from the SSE version can be seen](https://godbolt.org/g/nSBYQK). I put it on godbolt along with a version of the scalar code that actually compiles, too. — Peter Cordes, May 14 '16 at 10:27

Peter Cordes · Answer 1 · 2022-12-30T13:29:45.310

If you need to compile with -O0, then do as much as possible in a single statement. In normal code, int a=foo(); bar(a); will compile to the same asm as bar(foo()), but in -O0 code, the second version will probably be faster, because it doesn't store the result to memory and then reload it for the next statement.

-O0 is designed to give the most predictable results from debugging, which is why everything is stored to memory after every statement. This is obviously horrible for performance.

I wrote a big answer a while ago for a different question from someone else with a stupid assignment like yours that required them to optimize for -O0. Some of that may help.

Don't try too hard on this assignment. Probably most of the "tricks" that you figure out that make your code run faster with -O0 will only matter for -O0, but make no difference with optimization enabled.

In real life, code is typically compiled with clang or gcc -O2 at least, and sometimes -O3 -march=haswell or whatever to auto-vectorize. (Once it's debugged and you're ready to optimize.)

Re: your update:

Now it compiles, and the horrible asm from the SSE version can be seen. I put it on godbolt along with a version of the scalar code that actually compiles, too. Intrinsics usually compile very badly with optimization disabled, with the inline functions still having args and return values that result in actual load/store round trips (store-forwarding latency) even with __attribute__((always_inline)). See Demonstrator code failing to show 4 times faster SIMD speed with optimization disabled for example.

The scalar version comes out a lot less bad. Its source does everything in one expression, so temporaries stay in registers. The loop counter is still in memory, though, bottlenecking it to at best one iteration per 6 cycles on Haswell, for example. (See the x86 tag wiki for optimization resources.)

BTW, a vectorized fabsf() is easy, see Fastest way to compute absolute value using SSE. That and an SSE compare for less-than should do the trick to give you the same semantics as your scalar code. (But makes it even harder to get -O0 to not suck).

You might do better just manually unrolling your scalar version one or two times, because -O0 sucks too much.

score 1 · Answer 2 · answered May 14 '16 at 06:49

1

Some compilers are pretty good about doing optimization of vectors. Did you check the generated assembly of optimized build of both versions? Isn't the "naive" version actually using SIMD or other optimization techniques?

answered May 14 '16 at 06:49

MirekE

11,515
5
35
28

Yep... But cuz for our homework requirement, all the optimisation flags are switched off in compilers. – Jennifer Q May 14 '16 at 06:58
2

@JenniferQ: /facepalm. The `__m128` variables get stored/reloaded between statements in `-O0` compiler output. Optimizing code for `-O0` has significant differences from actually optimizing code to compile to good asm with `-O2` or `-O3`. I was going to look at the asm myself [on godbolt](https://godbolt.org/g/6aAfWu), but the code in your question doesn't even compile. e.g. conflicting variable names, see the error messages in that link. – Peter Cordes May 14 '16 at 09:55
It's my bad. I cut the code from a whole program but forget to change something... It should be fine to compile now. – Jennifer Q May 14 '16 at 10:17

SSE intrinsics without compiler optimization

2 Answers2

Linked

Related