Gcc misoptimises sse function

Question

I'm converting a project to compile with gcc from clang and I've ran into a issue with a function that uses sse functions:

void dodgy_function(
    const short* lows,
    const short* highs,
    short* mins,
    short* maxs,
    int its
)
{
    __m128i v00[2] = { _mm_setzero_si128(), _mm_setzero_si128() }; 
    __m128i v10[2] = { _mm_setzero_si128(), _mm_setzero_si128() };

    for (int i = 0; i < its; ++i) {
        reinterpret_cast<short*>(v00)[i] = lows[i];
        reinterpret_cast<short*>(v10)[i] = highs[i];
    }

    reinterpret_cast<short*>(v00)[its] = reinterpret_cast<short*>(v00)[its - 1];
    reinterpret_cast<short*>(v10)[its] = reinterpret_cast<short*>(v10)[its - 1];

    __m128i v01[2] = {_mm_setzero_si128(), _mm_setzero_si128()};
    __m128i v11[2] = {_mm_setzero_si128(), _mm_setzero_si128()};

    __m128i min[2];
    __m128i max[2];

    min[0] = _mm_min_epi16(_mm_max_epi16(v11[0], v01[0]), _mm_min_epi16(v10[0], v00[0]));
    max[0] = _mm_max_epi16(_mm_max_epi16(v11[0], v01[0]), _mm_max_epi16(v10[0], v00[0]));

    min[1] = _mm_min_epi16(_mm_min_epi16(v11[1], v01[1]), _mm_min_epi16(v10[1], v00[1])); 
    max[1] = _mm_max_epi16(_mm_max_epi16(v11[1], v01[1]), _mm_max_epi16(v10[1], v00[1]));

    reinterpret_cast<__m128i*>(mins)[0] = _mm_min_epi16(reinterpret_cast<__m128i*>(mins)[0], min[0]);
    reinterpret_cast<__m128i*>(maxs)[0] = _mm_max_epi16(reinterpret_cast<__m128i*>(maxs)[0], max[0]);

    reinterpret_cast<__m128i*>(mins)[1] = _mm_min_epi16(reinterpret_cast<__m128i*>(mins)[1], min[1]);
    reinterpret_cast<__m128i*>(maxs)[1] = _mm_max_epi16(reinterpret_cast<__m128i*>(maxs)[1], max[1]);
}

Now with clang it gives it gives me the expected output but in gcc it prints all zeros: godbolt link

Playing around I discovered that gcc gives me the right results when I compile with -O1 but goes wrong with -O2 and -O3, suggesting the optimiser is going awry. Is there something particularly wrong I'm doing that would cause this behavior?

As a workaround I can wrap things up in a union and gcc will then give me the right result, but that feels a little icky: godbolt link 2

Any ideas?

yeah looks like eliminating the aliasing in the write does it: https://godbolt.org/z/bfanne. If you can't force alignment like I did, I think the union is the cleanest way — Steve Cox, Mar 12 '21 at 19:58
"*the optimiser is going awry*"... what actually happens is you're invoking undefined behaviour due to strict aliasing violation, so nothing meaningful can be said about the outcome of the program. — rustyx, Mar 12 '21 at 21:03
well thats part of what I was trying to understand, if this is a gcc bug or something else, like UB, or both. — Biggy Smith, Mar 12 '21 at 21:30
also should we expect a strict aliasing violation warning here? — Biggy Smith, Mar 12 '21 at 21:37

score 4 · Accepted Answer · edited Mar 13 '21 at 05:06

4

The problem is that you're using short* to access the elements of a __m128i* object. That violates the strict-aliasing rule. It's only safe to go the other way, using __m128i* dereference or more normally _mm_load_si128( (const __m128i*)ptr ).

__m128i* is exactly like char* - you can point it at anything, but not vice versa: Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?

The only standard blessed way to do type punning is with memcpy:

    memcpy(v00, lows, its * sizeof(short));
    memcpy(v10, highs, its * sizeof(short));
    memcpy(reinterpret_cast<short*>(v00) + its, lows + its - 1, sizeof(short));
    memcpy(reinterpret_cast<short*>(v10) + its, highs + its - 1,  sizeof(short));

https://godbolt.org/z/f63q7x

I prefer just using aligned memory of the correct type directly:

    alignas(16) short v00[16];
    alignas(16) short v10[16];
    auto mv00 = reinterpret_cast<__m128i*>(v00);
    auto mv10 = reinterpret_cast<__m128i*>(v10);
    _mm_store_si128(mv00, _mm_setzero_si128());
    _mm_store_si128(mv10, _mm_setzero_si128());
    _mm_store_si128(mv00 + 1, _mm_setzero_si128());
    _mm_store_si128(mv10 + 1, _mm_setzero_si128());

    for (int i = 0; i < its; ++i) {
        v00[i] = lows[i];
        v10[i] = highs[i];
    }

    v00[its] = v00[its - 1];
    v10[its] = v10[its - 1];

https://godbolt.org/z/bfanne

I'm not positive that this setup is actually standard-blessed (it definitely is for _mm_load_ps since you can do it without type punning at all) but it does seem to also fix the issue. I'd guess that any reasonable implementation of the load/store intrinsics is going to have to provide the same sort of aliasing guarantees that memcpy does since it's more or less the kosher way to go from straight line to vectorized code in x86.

As you mentioned in your question, you can also force the alignment with a union, and I've used that too in pre c++11 contexts. Even in that case though, I still personally always write the loads and stores explicitly (even if they're just going to/from aligned memory) because issues like this tend to pop up if you don't.

edited Mar 13 '21 at 05:06

Peter Cordes

328,167
45
605
847

answered Mar 12 '21 at 20:13

Steve Cox

1,947
13
13

2

`__m128i` is an `__attribute__((may_alias))` type in GNU C; Intel's intrinsics are defined in a way that requires you to be able to point a `__m128i*` at anything else. (But *not* the reverse, which is where the OP went wrong and introduced UB by pointing a `short*` at a `__m128i` object.) [Is \`reinterpret\_cast\`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?](https://stackoverflow.com/q/52112605) – Peter Cordes Mar 13 '21 at 04:49
While this is safe, it also looks inefficient vs. SIMD loads from lows and highs (e.g. `_mm_loadu_si128((const __m128i*)lows)`). That's the whole point of using SIMD, after all... Pad the arrays, or check that it won't cross into a page that doesn't contain any of the data you're meant to be loading. You *could* then [look up an AND mask to zero the tail of the array](https://stackoverflow.com/q/34306933), but it might be easiest just to `memset(mins+its, 0, 32 - 2*its)` after doing the SIMD stores: zero the high elements where we stored garbage with SIMD, since we always store 32B – Peter Cordes Mar 13 '21 at 05:01
2

Oh, and forgot to mention: all `_mm_load*` / store intrinsics, even ones that take a `float*`, `int*`, or `__int64*` arg, are strict-aliasing safe. (Or at least supposed to be.) – Peter Cordes Mar 13 '21 at 05:02
@PeterCordes I agree, I probably would prefer unaligned loads from lows/highs too, but i wasn't going to rewrite this functions since I'm not sure how it's being used and didn't want to deal with partial loads in the case that `its < 16` or god forbid `its < 8`. I think generally the whole function could do for a nice rewrite – Steve Cox Mar 13 '21 at 12:43
1

Well like I said, you only need to check that `((uintptr_t)lows & 4095) <= (4096-32)` before you just go ahead and load 32 bytes, or maybe mix `its` into that somehow. Otherwise use a slow fallback for the unlikely case of page-crossing. But if you want to avoid even that much rewriting, I'd probably do `alignas(16) short v00[16] = {0};` to ask the compiler to zero-init the array instead of writing 4 intrinsics to do what the compiler is already going to do. – Peter Cordes Mar 13 '21 at 12:51
@PeterCordes isn't that assuming you know the page size (which you honestly probably do) and generally that you know that you're on a paged system? Or does intel itself provide those guarantees somewhere regardless of OS (and yeah i think the compiler can actually vectorize this entire function, it figured out the memcpy) – Steve Cox Mar 13 '21 at 13:03
x86 intrinsics only work on x86, and there's no forseeable way the page size will ever get *smaller* than 4k, which is x86's hardware page size, the granularity with which OSes can do memory protection. Yes it's theoretically possible you'd be on a system using segment limits with finer granularity, although not in 64-bit code. Checking for "not end of page" special cases is something you'll see in hand-written asm `strlen` or `strcmp` implementations like glibc's, where it's more useful to be able to start on vectors right away but with unaligned implicit-length strings you otherwise can't. – Peter Cordes Mar 13 '21 at 13:07
Even in a hypothetical x86 OS using segmentation, it's very likely the OS would be setting segment limits with 4k granularity, because it's either that or bytes, and the field is only 20 bits wide. So any program that wants more than a 1MiB segment needs to be using 4k granularity. (Those 1B or 4k are the 2 options for segment-limit granularity in a GDT entry. https://wiki.osdev.org/GDT). Of course in such a system it's possible that that 32-bit segment base is not a multiple of 16, let alone 4k, and seg:off -> linear happens before paging, so a small offset could cross a page boundary... – Peter Cordes Mar 13 '21 at 13:14
Basically everyone and their dog assumes that their x86 code is going to run on a system with a flat memory model, especially because 64-bit long mode makes that a requirement. Especially any "high performance" code, it's not a risky assumption. I would at least comment what it's doing, though, checking for not near end of page. There might be a standard header which has a page-size constant, but I wouldn't call `sysconf(_SC_PAGESIZE)` because that probably won't be a compile-time constant. – Peter Cordes Mar 13 '21 at 13:16
Oh duh. For some reason I thought the smallest page size used to by 512. Its hard going back and forth between completely hardware agnostic c++ and sse/neon/altivec with all of their implications – Steve Cox Mar 13 '21 at 13:20
The only standard header I know of that helps with anything like this is [hardware_destructive_interference_size](https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size) in , and it's still wasn't implemented in most compilers as of late last year. I'm not hoping to get the page_size anywhere for a long time. unistd.h always has getpagesize() but that's runtime. – Steve Cox Mar 13 '21 at 13:25
Right, ISO C/C++ don't standardize the notion of pages at all. IIRC, gcc and clang still don't really implement `hardware_destructive_interference_size` because any constant they pick for it could become part of a library ABI (struct layout and/or size), which would make it *not* possible to actually use correct values on far future hardware where it's different from now. (cache line size has been 64 bytes (= DDR burst transfer size) for a long time but *could* change. And it's only adjacent-line HW prefetch making 128 bytes a current good choice for destructive_interference_size.) – Peter Cordes Mar 13 '21 at 13:29

Gcc misoptimises sse function

1 Answers1