gcc optimization produce slower code

Question

I am trying to compile following code using gcc 4.8.2, If I compile it with g++ -mavx2 -O0 10bit.cpp I get following output from time command:

real 0m0.117s

user 0m0.116s

sys 0m0.000s

but when I enable optimization g++ -mavx2 -O3 10bit.cpp, output of time command shows longer execution time:

real 0m0.164s

user 0m0.164s

sys 0m0.000s

my CPU model name is: Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz which has AVX2 support. Also if I try SSE4.1 instruction instead of AVX2, my program completes much faster. Can someone please explain this?

#include <stdint.h>
#include "immintrin.h"

uint32_t int_values[8] __attribute__ ((__aligned__(32))); 
unsigned char buf[]    __attribute__ ((__aligned__(32))) = {
    0xFF, 0x9E, 0x8D, 0xCC, 0xBB, 0xAA, 0x99, 0x88, 0x77, 0x66, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x20, 0x21,
    0xFF, 0x9E, 0x8D, 0xCC, 0xBB, 0xAA, 0x99, 0x88, 0x77, 0x66, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x20, 0x21,
    0xFF, 0x9E, 0x8D, 0xCC, 0xBB, 0xAA, 0x99, 0x88, 0x77, 0x66, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x20, 0x21,
    0xFF, 0x9E, 0x8D, 0xCC, 0xBB, 0xAA, 0x99, 0x88, 0x77, 0x66, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x20, 0x21,
    0xFF, 0x9E, 0x8D, 0xCC, 0xBB, 0xAA, 0x99, 0x88, 0x77, 0x66, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x20, 0x21,
    0xFF, 0x9E, 0x8D, 0xCC, 0xBB, 0xAA, 0x99, 0x88, 0x77, 0x66, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x20, 0x21,
    0xFF, 0x9E, 0x8D, 0xCC, 0xBB, 0xAA, 0x99, 0x88, 0x77, 0x66, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x20, 0x21,
    0xFF, 0x9E, 0x8D, 0xCC, 0xBB, 0xAA, 0x99, 0x88, 0x77, 0x66, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x20, 0x21
};
unsigned char out[180] __attribute__ ((__aligned__(32)));
__m256i *__m_int_vals   = (__m256i *)int_values; 

int main() {
    for(int c = 0; c < 204800; c++) {
        for(int i = 0, j=0; i < sizeof(buf); i+=40, j+=32) {

            uint8_t *b = &buf[i];
            (* __m_int_vals) = _mm256_set_epi8(
                    b[35], b[36], b[37], b[38],
                    b[31], b[32], b[33], b[34],
                    b[26], b[27], b[28], b[29],
                    b[21], b[22], b[23], b[24],
                    b[16], b[17], b[18], b[19],
                    b[11], b[12], b[13], b[14],
                    b[6] , b[7] , b[8] , b[9],
                    b[1] , b[2] , b[3] , b[4]
                    );
            out[j]    = b[0];
            out[j+4]  = b[5];
            out[j+8]  = b[10];
            out[j+12] = b[15];
            out[j+16] = b[20];
            out[j+20] = b[25];
            out[j+24] = b[30];
            out[j+28] = b[35];
            (* __m_int_vals)  = _mm256_srli_epi32((*__m_int_vals), 2);
            out[j+3]  = int_values[0];
            out[j+7]  = int_values[1];
            out[j+11] = int_values[2];
            out[j+15] = int_values[3];
            out[j+19] = int_values[4];
            out[j+23] = int_values[5];
            out[j+27] = int_values[6];
            out[j+31] = int_values[7];
            (* __m_int_vals)  = _mm256_srli_epi32((*__m_int_vals), 10);
            out[j+2]  = int_values[0];
            out[j+6]  = int_values[1];
            out[j+10] = int_values[2];
            out[j+14] = int_values[3];
            out[j+18] = int_values[4];
            out[j+22] = int_values[5];
            out[j+26] = int_values[6];
            out[j+30] = int_values[7];
            (* __m_int_vals) = _mm256_srli_epi32((*__m_int_vals), 10);
            out[j+1]  = int_values[0];
            out[j+5]  = int_values[1];
            out[j+9]  = int_values[2];
            out[j+13] = int_values[3];
            out[j+17] = int_values[4];
            out[j+21] = int_values[5];
            out[j+25] = int_values[6];
            out[j+29] = int_values[7];
        }
    }
}

Your code violates the strict aliasing rule; so use the gcc switch `-fno-strict-aliasing` in both compilations — M.M, Nov 27 '15 at 21:59
That's a pretty short benchmark interval. Did you try profiling (e.g. with `perf`) to look for stalls? Also, @harold's suggestion of `_mm256_zeroupper()` is unlikely to help, since there are no function calls to legacy-SSE code in the loop. — Peter Cordes, Nov 27 '15 at 22:05
The x265 library used AVX2 optimizations in some parts of their code, but found that it caused things to run slower. The reason was because AVX2 instructions require a lot of power, and using them caused the CPU to exceed its power budget. In turn, the CPU had to turn its clock cycle speed down and so it could provide the needed power. I doubt you're exceeding your CPU's power budget, but it's a useful thing to keep in mind when using AVX2. — Cornstalks, Nov 27 '15 at 22:05
That `_mm256_set_epi8` is compiled to something absolutely atrocious btw. — harold, Nov 27 '15 at 22:10
Hey, Changing `i < 180` to `i < sizeof(buf)` completely changes the issue. Which case did you test against? Changing the question like this is considered poor form. Especially since it was like 12 minutes after I answered the question. — Shafik Yaghmour, Nov 27 '15 at 22:20
@ShafikYaghmour I changed the size of out, but still I see the same behaviour — Masoud, Nov 27 '15 at 22:26
@ShafikYaghmour That edit didn't change the issue. With `i < 180` it just ran even further out of bounds than it does with `i < sizeof(buf)`. Although the more recent edit to increase the size of `buf` and change the type and dimension of `out` does change the issue... — M.M, Nov 27 '15 at 22:36
@Masoud it seems pretty unlikely that you would make those code changes but still have the benchmark produce exactly the same output... downvoting — M.M, Nov 27 '15 at 22:37
Going out of bounds isn't the issue anyway. The code doesn't change if you make the buffer big enough. — harold, Nov 27 '15 at 22:39
@harold Huh? going out of bounds is always an issue, and the code did change (see edit history) — M.M, Nov 27 '15 at 22:41
@M.M it is bad, but it is not causing this. And his code changed, but the GCC generated code does not. Like [this](https://www.diffchecker.com/wdbj1mbt) — harold, Nov 27 '15 at 22:42
@harold even if that's true , the original code was benchmarking operations on garbage input , since it read beyond `buf` . We cannot make any definitive statements about what is "causing this" until the problem is reproduced by code that does not cause undefined behaviour. (The strict aliasing bug still has not been fixed either). — M.M, Nov 27 '15 at 22:46
I am deleting my answer, these is still undefined behavior, gcc just does not warn about it anymore. Benchmarks are still meaningless in the face of undefined behavior. — Shafik Yaghmour, Nov 27 '15 at 22:46
@ShafikYaghmour Your answer didn't answer my questions, my questions are why optimization on this code makes it slower and why SSE is much faster than AVX2. even with these changes this code produces the same result. Also I made the change before I see your answer — Masoud, Nov 27 '15 at 22:56
@FUZxxl, its more of a habit, gcc compilation also gives same result on this code — Masoud, Nov 27 '15 at 22:59
@Masoud Okay, so I'm removing the C tag then as this is not about C: If you compile with g++, your code is C++ code. Please do not compile C code as C++, these two are different languages with slightly different semantics. — fuz, Nov 27 '15 at 23:03
@FUZxxl I'd suggest that OP uses C and changes to use `gcc` instead of `g++` . That way, the strict aliasing bug can be solved more easily. — M.M, Nov 28 '15 at 00:28

gcc optimization produce slower code

0 Answers0