2

I try to work with SSE and i faced with some strange behaviour.

I write simple code for comparing two strings with SSE Intrinsics, run it and it work. But later i understand, that in my code one of pointer still not aligned, but i use _mm_load_si128 instruction, which requires pointer aligned on a 16-byte boundary.

//Compare two different, not overlapping piece of memory
__attribute((target("avx"))) int is_equal(const void* src_1, const void* src_2, size_t size)
{
    //Skip tail for right alignment of pointer [head_1]
    const char* head_1 = (const char*)src_1;
    const char* head_2 = (const char*)src_2;
    size_t tail_n = 0;
    while (((uintptr_t)head_1 % 16) != 0 && tail_n < size)
    {                                
        if (*head_1 != *head_2)
            return 0;
        head_1++, head_2++, tail_n++;
    }

    //Vectorized part: check equality of memory with SSE4.1 instructions
    //src1 - aligned, src2 - NOT aligned
    const __m128i* src1 = (const __m128i*)head_1;
    const __m128i* src2 = (const __m128i*)head_2;
    const size_t n = (size - tail_n) / 32;    
    for (size_t i = 0; i < n; ++i, src1 += 2, src2 += 2)
    {
        printf("src1 align: %d, src2 align: %d\n", align(src1) % 16, align(src2) % 16);
        __m128i mm11 = _mm_load_si128(src1);
        __m128i mm12 = _mm_load_si128(src1 + 1);
        __m128i mm21 = _mm_load_si128(src2);
        __m128i mm22 = _mm_load_si128(src2 + 1);

        __m128i mm1 = _mm_xor_si128(mm11, mm21);
        __m128i mm2 = _mm_xor_si128(mm12, mm22);

        __m128i mm = _mm_or_si128(mm1, mm2);

        if (!_mm_testz_si128(mm, mm))
            return 0;
    }

    //Check tail with scalar instructions
    const size_t rem = (size - tail_n) % 32;
    const char* tail_1 = (const char*)src1;
    const char* tail_2 = (const char*)src2;
    for (size_t i = 0; i < rem; i++, tail_1++, tail_2++)
    {
        if (*tail_1 != *tail_2)
            return 0;   
    }
    return 1;
}

I print alignment of two pointers and one of this wal aligned but second - wasn't. And program still running correctly and fast.

Then i create synthetic test like this:

//printChars128(...) function just print 16 byte values from __m128i
const __m128i* A = (const __m128i*)buf;
const __m128i* B = (const __m128i*)(buf + rand() % 15 + 1);
for (int i = 0; i < 5; i++, A++, B++)
{
    __m128i A1 = _mm_load_si128(A);
    __m128i B1 = _mm_load_si128(B);
    printChars128(A1);
    printChars128(B1);
}

And it crashes, as we expected, on first iteration, when try load pointer B.

Interesting fact that if i switch target to sse4.2 then my implementation of is_equal will crash.

Another interesting fact that if i try align second pointer instead of first (so first pointer will be not aligned, second - aligned), then is_equal will crash.

So, my question is: "Why is_equal function works fine with only first pointer aligned if i enable avx instruction generation?"

UPD: This is C++ code. I compile my code with MinGW64/g++, gcc version 4.9.2 under Windows, x86.

Compile string: g++.exe main.cpp -Wall -Wextra -std=c++11 -O2 -Wcast-align -Wcast-qual -o main.exe

Nikita Sivukhin
  • 2,370
  • 3
  • 16
  • 33
  • Have you inspected the assembly to make sure sse4.2 is actually sse4.2 and avx is actually avx? Knowing the opcodes involved in the loads will help understand the situation – Daniel Jul 18 '16 at 18:25
  • 2
    VEX-encoded instructions that take memory operands (excluding aligned moves) do not need to be aligned. Specifying AVX will make the compiler use VEX-encoded instructions. IOW, you got (un)lucky when it happened to work when you turned on AVX. It can still crash if GCC decides to use any normal (aligned) moves. – Mysticial Jul 18 '16 at 18:26
  • 1
    If the load got rolled into an argument, it loses the alignment requirements (except with legacy encoding), disassemble to confirm – harold Jul 18 '16 at 18:28
  • @Mysticial But why my synthetic test crashes? I think i create the same situation as in `is_equal` function... – Nikita Sivukhin Jul 18 '16 at 18:28
  • 1
    But it's not the same, there are no instructions there that the load could be rolled into, unlike the first case. – harold Jul 18 '16 at 18:32
  • Is that C or C++? They are different languages and your code does make use of the differences! – too honest for this site Jul 18 '16 at 18:34
  • 1
    @Olaf This question is valid for both C and C++. – Mysticial Jul 18 '16 at 18:36
  • @Mysticial: Casting `void *` is deprecated in C and required in C++. Using C-style casts is deprecated in C++. So no, the code is not good in both languages! – too honest for this site Jul 18 '16 at 18:53
  • @Olaf I don't know if you've noticed already, but this question has absolutely nothing to do with C-style casts. But you're free to pounce on anything you like. – Mysticial Jul 18 '16 at 18:57
  • @Mysticial: So you did not notice the casts ... – too honest for this site Jul 18 '16 at 18:59
  • 1
    @Olaf I'm just saying that the casts don't matter for this question. This question is about SSE and alignment, not about proper C++ coding style. But I can't stop anyone from nitpicking on that anyways. – Mysticial Jul 18 '16 at 19:06
  • Can you list your Compiler flags? Do you enable optimization? – Rotem Jul 18 '16 at 19:11
  • Is is possible that in your main test, `src_1 = src_2`? – Rotem Jul 18 '16 at 19:16
  • @Rotem no, this pointers different. I pointed this in first line of code. Actually, in main test, i read two `char*` strings and compare some inner substrings. – Nikita Sivukhin Jul 18 '16 at 19:19
  • Try the following: Disable all optimizations. Run step by step using the debugger, and verify `_mm_load_si128(src2);` and `_mm_load_si128(src2 + 1);` are two separate commands. – Rotem Jul 18 '16 at 19:28
  • When AVX is enabled, it might be that two sequential 128bits load operations are **fused** to single 256bits unaligned AVX load operation. (I don't know much about micro-fusion, but it could be related). – Rotem Jul 18 '16 at 19:31
  • @Rotem: micro-fusion is when `vpxor xmm1, xmm0, [mem]` is decoded to a single uop. What's happening here is compile-time **folding** of the `_mm_load_si128` into a memory operand. So you get `vpxor` with a memory operand instead of `vmovdqa xmm1, [mem]` / `vpxor xmm1, xmm1, xmm0`. VMOVDQA will fault on unaligned, but VPXOR won't. Neither of these things will combine two sequential 128bit ops into a single 256bit op. That would require a clever compiler (and only be possible with AVX2). – Peter Cordes Jul 19 '16 at 01:58

1 Answers1

7

TL:DR: Loads from _mm_load_* intrinsics can be folded (at compile time) into memory operands to other instructions. The AVX versions of vector instructions don't require alignment for memory operands, except for specifically-aligned load/store instructions like vmovdqa.


In the legacy SSE encoding of vector instructions (like pxor xmm0, [src1]) , unaligned 128 bit memory operands will fault except with the special unaligned load/store instructions (like movdqu / movups).

The VEX-encoding of vector instructions (like vpxor xmm1, xmm0, [src1]) doesn't fault with unaligned memory, except with the alignment-required load/store instructions (like vmovdqa, or vmovntdq).


The _mm_loadu_si128 vs. _mm_load_si128 (and store/storeu) intrinsics communicate alignment guarantees to the compiler, but doesn't force it to actually emit a stand-alone load instruction. (Or anything at all if it already has the data in a register, just like dereferencing a scalar pointer).

The as-if rule still applies when optimizing code that uses intrinsics. A load can be folded into a memory operand for the vector-ALU instruction that uses it, as long as that doesn't introduce the risk of a fault. This is advantageous for code-density reasons, and also fewer uops to track in parts of the CPU thanks to micro-fusion (see Agner Fog's microarch.pdf). The optimization pass that does this isn't enabled at -O0, so an unoptimized build of your code probably would have faulted with unaligned src1.

(Conversely, this means _mm_loadu_* can only fold into a memory operand with AVX, but not with SSE. So even on CPUs where movdqu is as fast as movqda when the pointer does happen to be aligned, _mm_loadu can hurt performance because movqdu xmm1, [rsi] / pxor xmm0, xmm1 is 2 fused-domain uops for the front-end to issue while pxor xmm0, [rsi] is only 1. And doesn't need a scratch register. See also Micro fusion and addressing modes).

The interpretation of the as-if rule in this case is that it's ok for the program to not fault in some cases where the naive translation into asm would have faulted. (Or for the same code to fault in an un-optimized build but not fault in an optimized build).

This is opposite from the rules for floating-point exceptions, where the compiler-generated code must still raise any and all exceptions that would have occurred on the C abstract machine. That's because there are well-defined mechanisms for handling FP exceptions, but not for handling segfaults.


Note that since stores can't fold into memory operands for ALU instructions, store (not storeu) intrinsics will compile into code that faults with unaligned pointers even when compiling for an AVX target.


To be specific: consider this code fragment:

// aligned version:
y = ...;                         // assume it's in xmm1
x = _mm_load_si128(Aptr);        // Aligned pointer
res = _mm_or_si128(y, x);

// unaligned version: the same thing with _mm_loadu_si128(Uptr)

When targeting SSE (code that can run on CPUs without AVX support), the aligned version can fold the load into por xmm1, [Aptr], but the unaligned version has to use
movdqu xmm0, [Uptr] / por xmm0, xmm1. The aligned version might do that too, if the old value of y is still needed after the OR.

When targeting AVX (gcc -mavx, or gcc -march=sandybridge or later), all vector instructions emitted (including 128 bit) will use the VEX encoding. So you get different asm from the same _mm_... intrinsics. Both versions can compile into vpor xmm0, xmm1, [ptr]. (And the 3-operand non-destructive feature means that this actually happens except when the original value loaded is used multiple times).

Only one operand to ALU instructions can be a memory operand, so in your case one has to be loaded separately. Your code faults when the first pointer isn't aligned, but doesn't care about alignment for the second, so we can conclude that gcc chose to load the first operand with vmovdqa and fold the second, rather than vice-versa.

You can see this happen in practice in your code on the Godbolt compiler explorer. Unfortunately gcc 4.9 (and 5.3) compile it to somewhat sub-optimal code that generates the return value in al and then tests it, instead of just branching on the flags from vptest :( clang-3.8 does a significantly better job.

.L36:
        add     rdi, 32
        add     rsi, 32
        cmp     rdi, rcx
        je      .L9
.L10:
        vmovdqa xmm0, XMMWORD PTR [rdi]           # first arg: loads that will fault on unaligned
        xor     eax, eax
        vpxor   xmm1, xmm0, XMMWORD PTR [rsi]     # second arg: loads that don't care about alignment
        vmovdqa xmm0, XMMWORD PTR [rdi+16]        # first arg
        vpxor   xmm0, xmm0, XMMWORD PTR [rsi+16]  # second arg
        vpor    xmm0, xmm1, xmm0
        vptest  xmm0, xmm0
        sete    al                                 # generate a boolean in a reg
        test    eax, eax
        jne     .L36                               # then test&branch on it.  /facepalm

Note that your is_equal is memcmp. I think glibc's memcmp will do better than your implementation in many cases, since it has hand-written asm versions for SSE4.1 and others which handle various cases of the buffers being misaligned relative to each other. (e.g. one aligned, one not.) Note that glibc code is LGPLed, so you might not be able to just copy it. If your use-case has smaller buffers that are typically aligned, your implementation is probably good. Not needing a VZEROUPPER before calling it from other AVX code is also nice.

The compiler-generated byte-loop to clean up at the end is definitely sub-optimal. If the size is bigger than 16 bytes, do an unaligned load that ends at the last byte of each src. It doesn't matter that you re-compared some bytes you've already checked.

Anyway, definitely benchmark your code with the system memcmp. Besides the library implementation, gcc knows what memcmp does and has its own builtin definition that it can inline code for.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Sorry for this late comment, but why if i don't align first pointer, than program stable crash? This is because i'm unlucky and compiler decided not to use VEX-encoded instructions? Or something else wrong? – Nikita Sivukhin Jul 31 '16 at 18:10
  • @NikitaSivukhin: `vmovdqa` still requires alignment. Use `_mm_loadu_...` if you want unaligned pointers to only be a potential performance problem, instead of a potential crash. (The compiler will use `vmovdqu` for loads/stores.) – Peter Cordes Jul 31 '16 at 18:27
  • Okay, i forgot about it... =( But this is strange thing about your asm from Godbold: `vmovdqa` uses with `XMMWORD PTR [rsi+16]` and `XMMWORD PTR [rdi]` which is two differente pointers - one in src_1 and one in src_2. And one of this is unaligned and this code must crash. So, is it true and on my computer generates slightly different code or i'm wrong? – Nikita Sivukhin Jul 31 '16 at 19:11
  • @NikitaSivukhin: Yes, it looks like the code generated by gcc 5.3 will fault if either pointer arg is unaligned. Oh, I guess that's not what the text of my answer says. Is that what you were trying to point out? I changed the godbolt link to using gcc 4.9.2 like you were using. It's still targeting the x86-64 SysV ABI, not mingw, but should behave like you described. You can of course just check yourself with `objdump` or `gcc -S`; godbolt is just a text filter to clean up the compiler asm output. – Peter Cordes Jul 31 '16 at 20:52