Wrong gcc generated assembly ordering, results in performance hit

Question

I have got the following code, which copies data from memory to DMA buffer:

for (; likely(l > 0); l-=128)
{
    __m256i m0 = _mm256_load_si256( (__m256i*) (src) );
    __m256i m1 = _mm256_load_si256( (__m256i*) (src+32) );
    __m256i m2 = _mm256_load_si256( (__m256i*) (src+64) );
    __m256i m3 = _mm256_load_si256( (__m256i*) (src+96) );

    _mm256_stream_si256( (__m256i *) (dst), m0 );
    _mm256_stream_si256( (__m256i *) (dst+32), m1 );
    _mm256_stream_si256( (__m256i *) (dst+64), m2 );
    _mm256_stream_si256( (__m256i *) (dst+96), m3 );

    src += 128;
    dst += 128;
}

That is how gcc assembly output looks like:

405280:       c5 fd 6f 50 20          vmovdqa 0x20(%rax),%ymm2
405285:       c5 fd 6f 48 40          vmovdqa 0x40(%rax),%ymm1
40528a:       c5 fd 6f 40 60          vmovdqa 0x60(%rax),%ymm0
40528f:       c5 fd 6f 18             vmovdqa (%rax),%ymm3
405293:       48 83 e8 80             sub    $0xffffffffffffff80,%rax
405297:       c5 fd e7 52 20          vmovntdq %ymm2,0x20(%rdx)
40529c:       c5 fd e7 4a 40          vmovntdq %ymm1,0x40(%rdx)
4052a1:       c5 fd e7 42 60          vmovntdq %ymm0,0x60(%rdx)
4052a6:       c5 fd e7 1a             vmovntdq %ymm3,(%rdx)
4052aa:       48 83 ea 80             sub    $0xffffffffffffff80,%rdx
4052ae:       48 39 c8                cmp    %rcx,%rax
4052b1:       75 cd                   jne    405280 <sender_body+0x6e0>

Note the reordering of last vmovdqa and vmovntdq instructions. With the gcc generated code above I am able to reach throughput of ~10 227 571 packets per second in my application.

Next, I reorder that instructions manually in hexeditor. That means now the loop looks the following way:

405280:       c5 fd 6f 18             vmovdqa (%rax),%ymm3
405284:       c5 fd 6f 50 20          vmovdqa 0x20(%rax),%ymm2
405289:       c5 fd 6f 48 40          vmovdqa 0x40(%rax),%ymm1
40528e:       c5 fd 6f 40 60          vmovdqa 0x60(%rax),%ymm0
405293:       48 83 e8 80             sub    $0xffffffffffffff80,%rax
405297:       c5 fd e7 1a             vmovntdq %ymm3,(%rdx)
40529b:       c5 fd e7 52 20          vmovntdq %ymm2,0x20(%rdx)
4052a0:       c5 fd e7 4a 40          vmovntdq %ymm1,0x40(%rdx)
4052a5:       c5 fd e7 42 60          vmovntdq %ymm0,0x60(%rdx)
4052aa:       48 83 ea 80             sub    $0xffffffffffffff80,%rdx
4052ae:       48 39 c8                cmp    %rcx,%rax
4052b1:       75 cd                   jne    405280 <sender_body+0x6e0>

With the properly ordered instructions I get ~13 668 313 packets per second. So it is obvious that reordering introduced by gcc reduces performance.

Have you come across that? Is this a known bug or should I fill a bug report?

Compilation flags:

-O3 -pipe -g -msse4.1 -mavx

My gcc version:

gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

Not directly related to your question, but can `src` and `dest` overlap? If not, using the `restrict` keyword on both would probably allow the compiler to generate code that's more efficient than either version... — R.. GitHub STOP HELPING ICE, Sep 11 '14 at 03:21
Good point, however `restrict` keyword would not change anything in case of simple one-to-one copying like that. — Piotr Jurkiewicz, Sep 11 '14 at 04:09
It doesn't seem like a bug to me, unless this causes the actual behaviour of the program to differ... Just a guess, but have you considered using `volatile __m256i`? — autistic, Dec 29 '15 at 14:26
@Seb: Performance bugs are one class of compiler bugs. reported as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69622 — Peter Cordes, Feb 02 '16 at 12:17

score 11 · Answer 1 · edited May 23 '17 at 11:58

11

I find this problem interesting. GCC is known for producing less than optimal code, but I find it fascinating to find ways to "encourage" it to produce better code (for hottest/bottleneck code only, of course), without micro-managing too heavily. In this particular case, I looked at three "tools" I use for such situations:

volatile: If it is important the memory accesses occur in specific order, then volatile is a suitable tool. Note that it can be overkill, and will lead to a separate load every time a volatile pointer is dereferenced.

SSE/AVX load/store intrinsics can't be used with volatile pointers, because they are functions. Using something like _mm256_load_si256((volatile __m256i *)src); implicitly casts it to const __m256i*, losing the volatile qualifier.

We can directly dereference volatile pointers, though. (load/store intrinsics are only needed when we need to tell the compiler that the data might be unaligned, or that we want a streaming store.)
```
m0 = ((volatile __m256i *)src)[0];
m1 = ((volatile __m256i *)src)[1];
m2 = ((volatile __m256i *)src)[2];
m3 = ((volatile __m256i *)src)[3];
```
Unfortunately this doesn't help with the stores, because we want to emit streaming stores. A *(volatile...)dst = tmp; won't give us what we want.
__asm__ __volatile__ (""); as a compiler reordering barrier.

This is the GNU C was of writing a compiler memory-barrier. (Stopping compile-time reordering without emitting an actual barrier instruction like mfence). It stops the compiler from reordering memory accesses across this statement.
Using an index limit for loop structures.

GCC is known for pretty poor register usage. Earlier versions made a lot of unnecessary moves between registers, although that is pretty minimal nowadays. However, testing on x86-64 across many versions of GCC indicate that in loops, it is better to use an index limit, rather than a independent loop variable, for best results.

Combining all the above, I constructed the following function (after a few iterations):

#include <stdlib.h>
#include <immintrin.h>

#define likely(x) __builtin_expect((x), 1)
#define unlikely(x) __builtin_expect((x), 0)

void copy(void *const destination, const void *const source, const size_t bytes)
{
    __m256i       *dst = (__m256i *)destination;
    const __m256i *src = (const __m256i *)source;
    const __m256i *end = (const __m256i *)source + bytes / sizeof (__m256i);

    while (likely(src < end)) {
        const __m256i m0 = ((volatile const __m256i *)src)[0];
        const __m256i m1 = ((volatile const __m256i *)src)[1];
        const __m256i m2 = ((volatile const __m256i *)src)[2];
        const __m256i m3 = ((volatile const __m256i *)src)[3];

        _mm256_stream_si256( dst,     m0 );
        _mm256_stream_si256( dst + 1, m1 );
        _mm256_stream_si256( dst + 2, m2 );
        _mm256_stream_si256( dst + 3, m3 );

        __asm__ __volatile__ ("");

        src += 4;
        dst += 4;
    }
}

Compiling it (example.c) using GCC-4.8.4 using

gcc -std=c99 -mavx2 -march=x86-64 -mtune=generic -O2 -S example.c

yields (example.s):

        .file   "example.c"
        .text
        .p2align 4,,15
        .globl  copy
        .type   copy, @function
copy:
.LFB993:
        .cfi_startproc
        andq    $-32, %rdx
        leaq    (%rsi,%rdx), %rcx
        cmpq    %rcx, %rsi
        jnb     .L5
        movq    %rsi, %rax
        movq    %rdi, %rdx
        .p2align 4,,10
        .p2align 3
.L4:
        vmovdqa (%rax), %ymm3
        vmovdqa 32(%rax), %ymm2
        vmovdqa 64(%rax), %ymm1
        vmovdqa 96(%rax), %ymm0
        vmovntdq        %ymm3, (%rdx)
        vmovntdq        %ymm2, 32(%rdx)
        vmovntdq        %ymm1, 64(%rdx)
        vmovntdq        %ymm0, 96(%rdx)
        subq    $-128, %rax
        subq    $-128, %rdx
        cmpq    %rax, %rcx
        ja      .L4
        vzeroupper
.L5:
        ret
        .cfi_endproc
.LFE993:
        .size   copy, .-copy
        .ident  "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4"
        .section        .note.GNU-stack,"",@progbits

The disassembly of the actual compiled (-c instead of -S) code is

0000000000000000 <copy>:
   0:   48 83 e2 e0             and    $0xffffffffffffffe0,%rdx
   4:   48 8d 0c 16             lea    (%rsi,%rdx,1),%rcx
   8:   48 39 ce                cmp    %rcx,%rsi
   b:   73 41                   jae    4e <copy+0x4e>
   d:   48 89 f0                mov    %rsi,%rax
  10:   48 89 fa                mov    %rdi,%rdx
  13:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  18:   c5 fd 6f 18             vmovdqa (%rax),%ymm3
  1c:   c5 fd 6f 50 20          vmovdqa 0x20(%rax),%ymm2
  21:   c5 fd 6f 48 40          vmovdqa 0x40(%rax),%ymm1
  26:   c5 fd 6f 40 60          vmovdqa 0x60(%rax),%ymm0
  2b:   c5 fd e7 1a             vmovntdq %ymm3,(%rdx)
  2f:   c5 fd e7 52 20          vmovntdq %ymm2,0x20(%rdx)
  34:   c5 fd e7 4a 40          vmovntdq %ymm1,0x40(%rdx)
  39:   c5 fd e7 42 60          vmovntdq %ymm0,0x60(%rdx)
  3e:   48 83 e8 80             sub    $0xffffffffffffff80,%rax
  42:   48 83 ea 80             sub    $0xffffffffffffff80,%rdx
  46:   48 39 c1                cmp    %rax,%rcx
  49:   77 cd                   ja     18 <copy+0x18>
  4b:   c5 f8 77                vzeroupper 
  4e:   c3                      retq

Without any optimizations at all, the code is completely disgusting, full of unnecessary moves, so some optimization is necessary. (The above uses -O2, which is generally the optimization level I use.)

If optimizing for size (-Os), the code looks excellent at first glance,

0000000000000000 <copy>:
   0:   48 83 e2 e0             and    $0xffffffffffffffe0,%rdx
   4:   48 01 f2                add    %rsi,%rdx
   7:   48 39 d6                cmp    %rdx,%rsi
   a:   73 30                   jae    3c <copy+0x3c>
   c:   c5 fd 6f 1e             vmovdqa (%rsi),%ymm3
  10:   c5 fd 6f 56 20          vmovdqa 0x20(%rsi),%ymm2
  15:   c5 fd 6f 4e 40          vmovdqa 0x40(%rsi),%ymm1
  1a:   c5 fd 6f 46 60          vmovdqa 0x60(%rsi),%ymm0
  1f:   c5 fd e7 1f             vmovntdq %ymm3,(%rdi)
  23:   c5 fd e7 57 20          vmovntdq %ymm2,0x20(%rdi)
  28:   c5 fd e7 4f 40          vmovntdq %ymm1,0x40(%rdi)
  2d:   c5 fd e7 47 60          vmovntdq %ymm0,0x60(%rdi)
  32:   48 83 ee 80             sub    $0xffffffffffffff80,%rsi
  36:   48 83 ef 80             sub    $0xffffffffffffff80,%rdi
  3a:   eb cb                   jmp    7 <copy+0x7>
  3c:   c3                      retq

until you notice that the last jmp is to the comparison, essentially doing a jmp, cmp, and a jae at every iteration, which probably yields pretty poor results.

Note: If you do something similar for real-world code, please do add comments (especially for the __asm__ __volatile__ ("");), and remember to periodically check with all compilers available, to make sure the code is not compiled too badly by any.

Looking at Peter Cordes' excellent answer, I decided to iterate the function a bit further, just for fun.

As Ross Ridge mentions in the comments, when using _mm256_load_si256() the pointer is not dereferenced (prior to being re-cast to aligned __m256i * as a parameter to the function), thus volatile won't help when using _mm256_load_si256(). In another comment, Seb suggests a workaround: _mm256_load_si256((__m256i []){ *(volatile __m256i *)(src) }), which supplies the function with a pointer to src by accessing the element via a volatile pointer and casting it to an array. For a simple aligned load, I prefer the direct volatile pointer; it matches my intent in the code. (I do aim for KISS, although often I hit only the stupid part of it.)

On x86-64, the start of the inner loop is aligned to 16 bytes, so the number of operations in the function "header" part is not really important. Still, avoiding the superfluous binary AND (masking the five least significant bits of the amount to copy in bytes) is certainly useful in general.

GCC provides two options for this. One is the __builtin_assume_aligned() built-in, which allows a programmer to convey all sorts of alignment information to the compiler. The other is typedef'ing a type that has extra attributes, here __attribute__((aligned (32))), which can be used to convey the alignedness of function parameters for example. Both of these should be available in clang (although support is recent, not in 3.5 yet), and may be available in others such as icc (although ICC, AFAIK, uses __assume_aligned()).

One way to mitigate the register shuffling GCC does, is to use a helper function. After some further iterations, I arrived at this, another.c:

#include <stdlib.h>
#include <immintrin.h>

#define likely(x)   __builtin_expect((x), 1)
#define unlikely(x) __builtin_expect((x), 0)

#if (__clang_major__+0 >= 3)
#define IS_ALIGNED(x, n) ((void *)(x))
#elif (__GNUC__+0 >= 4)
#define IS_ALIGNED(x, n) __builtin_assume_aligned((x), (n))
#else
#define IS_ALIGNED(x, n) ((void *)(x))
#endif

typedef __m256i __m256i_aligned __attribute__((aligned (32)));


void do_copy(register          __m256i_aligned *dst,
             register volatile __m256i_aligned *src,
             register          __m256i_aligned *end)
{
    do {
        register const __m256i m0 = src[0];
        register const __m256i m1 = src[1];
        register const __m256i m2 = src[2];
        register const __m256i m3 = src[3];

        __asm__ __volatile__ ("");

        _mm256_stream_si256( dst,     m0 );
        _mm256_stream_si256( dst + 1, m1 );
        _mm256_stream_si256( dst + 2, m2 );
        _mm256_stream_si256( dst + 3, m3 );

        __asm__ __volatile__ ("");

        src += 4;
        dst += 4;

    } while (likely(src < end));
}

void copy(void *dst, const void *src, const size_t bytes)
{
    if (bytes < 128)
        return;

    do_copy(IS_ALIGNED(dst, 32),
            IS_ALIGNED(src, 32),
            IS_ALIGNED((void *)((char *)src + bytes), 32));
}

which compiles with gcc -march=x86-64 -mtune=generic -mavx2 -O2 -S another.c to essentially (comments and directives omitted for brevity):

do_copy:
.L3:
        vmovdqa  (%rsi), %ymm3
        vmovdqa  32(%rsi), %ymm2
        vmovdqa  64(%rsi), %ymm1
        vmovdqa  96(%rsi), %ymm0
        vmovntdq %ymm3, (%rdi)
        vmovntdq %ymm2, 32(%rdi)
        vmovntdq %ymm1, 64(%rdi)
        vmovntdq %ymm0, 96(%rdi)
        subq     $-128, %rsi
        subq     $-128, %rdi
        cmpq     %rdx, %rsi
        jb       .L3
        vzeroupper
        ret

copy:
        cmpq     $127, %rdx
        ja       .L8
        rep ret
.L8:
        addq     %rsi, %rdx
        jmp      do_copy

Further optimization at -O3 just inlines the helper function,

do_copy:
.L3:
        vmovdqa  (%rsi), %ymm3
        vmovdqa  32(%rsi), %ymm2
        vmovdqa  64(%rsi), %ymm1
        vmovdqa  96(%rsi), %ymm0
        vmovntdq %ymm3, (%rdi)
        vmovntdq %ymm2, 32(%rdi)
        vmovntdq %ymm1, 64(%rdi)
        vmovntdq %ymm0, 96(%rdi)
        subq     $-128, %rsi
        subq     $-128, %rdi
        cmpq     %rdx, %rsi
        jb       .L3
        vzeroupper
        ret

copy:
        cmpq     $127, %rdx
        ja       .L10
        rep ret
.L10:
        leaq     (%rsi,%rdx), %rax
.L8:
        vmovdqa  (%rsi), %ymm3
        vmovdqa  32(%rsi), %ymm2
        vmovdqa  64(%rsi), %ymm1
        vmovdqa  96(%rsi), %ymm0
        vmovntdq %ymm3, (%rdi)
        vmovntdq %ymm2, 32(%rdi)
        vmovntdq %ymm1, 64(%rdi)
        vmovntdq %ymm0, 96(%rdi)
        subq     $-128, %rsi
        subq     $-128, %rdi
        cmpq     %rsi, %rax
        ja       .L8
        vzeroupper
        ret

and even with -Os the generated code is very nice,

do_copy:
.L3:
        vmovdqa  (%rsi), %ymm3
        vmovdqa  32(%rsi), %ymm2
        vmovdqa  64(%rsi), %ymm1
        vmovdqa  96(%rsi), %ymm0
        vmovntdq %ymm3, (%rdi)
        vmovntdq %ymm2, 32(%rdi)
        vmovntdq %ymm1, 64(%rdi)
        vmovntdq %ymm0, 96(%rdi)
        subq     $-128, %rsi
        subq     $-128, %rdi
        cmpq     %rdx, %rsi
        jb       .L3
        ret

copy:
        cmpq     $127, %rdx
        jbe      .L5
        addq     %rsi, %rdx
        jmp      do_copy
.L5:
        ret

Of course, without optimizations GCC-4.8.4 still produces pretty bad code. With clang-3.5 -march=x86-64 -mtune=generic -mavx2 -O2 and -Os we get essentially

do_copy:
.LBB0_1:
        vmovaps  (%rsi), %ymm0
        vmovaps  32(%rsi), %ymm1
        vmovaps  64(%rsi), %ymm2
        vmovaps  96(%rsi), %ymm3
        vmovntps %ymm0, (%rdi)
        vmovntps %ymm1, 32(%rdi)
        vmovntps %ymm2, 64(%rdi)
        vmovntps %ymm3, 96(%rdi)
        subq     $-128, %rsi
        subq     $-128, %rdi
        cmpq     %rdx, %rsi
        jb       .LBB0_1
        vzeroupper
        retq

copy:
        cmpq     $128, %rdx
        jb       .LBB1_3
        addq     %rsi, %rdx
.LBB1_2:
        vmovaps  (%rsi), %ymm0
        vmovaps  32(%rsi), %ymm1
        vmovaps  64(%rsi), %ymm2
        vmovaps  96(%rsi), %ymm3
        vmovntps %ymm0, (%rdi)
        vmovntps %ymm1, 32(%rdi)
        vmovntps %ymm2, 64(%rdi)
        vmovntps %ymm3, 96(%rdi)
        subq     $-128, %rsi
        subq     $-128, %rdi
        cmpq     %rdx, %rsi
        jb       .LBB1_2
.LBB1_3:
        vzeroupper
        retq

I like the another.c code (it suits my coding style), and I'm happy with the code generated by GCC-4.8.4 and clang-3.5 at -O1, -O2, -O3, and -Os on both, so I think it is good enough for me. (Note, however, that I haven't actually benchmarked any of this, because I don't have the relevant code. We use both temporal and non-temporal (nt) memory accesses, and cache behaviour (and cache interaction with the surrounding code) is paramount for things like this, so it would make no sense to microbenchmark this, I think.)

edited May 23 '17 at 11:58

Community

1
1

answered Feb 02 '16 at 07:41

Nominal Animal

38,216
5
59
86

2

The volatile qualifier probably doesn't work because the qualifier is lost when passed as argument to _mm256_load_si256. – Ross Ridge Feb 02 '16 at 07:56
@RossRidge: Yes, I agree. The part I'm unsure about, is whether it is allowed per the ISO C standards. (Accessing a volatile object being a *side effect* as the standard words it, and function calls being sequence points..) In all cases the rearranging pattern looks suspicious to me, because the initial one is done last, but the rest are kept in the same order; this is the main reason I think this might be a bug in GCC. – Nominal Animal Feb 02 '16 at 08:43
2

There's no actual dereference of the volatile pointer, so there's no actual access of a volatile object. Your volatile cast has no effect because the volatile qualifier is immediately lost as the pointer is converted to `__m256i const *` when passed as an argument to _mm256_load_si256. – Ross Ridge Feb 02 '16 at 09:12
@RossRidge: Ordering of the loads isn't even the important part. It's ordering of the WC stores to minimize flushing partially-full fill buffers that's critical here. I posted an answer (that's probably way longer than it needs to be, xD). AFAICT, just one `asm volatile("");` anywhere in the loop is enough to stop gcc's reordering. It probably takes the simple approach and doesn't try at all if there's a compiler barrier in the block it's optimizing, rather than doing as much as possible while still not violating the barrier. My version should be safe no matter what a compiler tries. – Peter Cordes Feb 02 '16 at 10:48
@Seb: Hmm, I see; very clever! Although, I wonder if some stupid compilers do extra loads and stores with that.. hopefully not. If the function was a nontemporal load or anything other than a standard load, I would use that, but with a plain aligned load, I prefer the volatile access. – Nominal Animal Feb 02 '16 at 17:27
@Seb: What's the point of that? The load instrinsics mainly exist to help you communicate alignment guarantees or lack thereof to the compiler. Compilers will fold an AVX aligned-load into a memory operand for another insn (dropping the alignment requirement), so using one doesn't eve guarantee that your code will detect and fault on unaligned inputs. Anyway, if you've already dereferenced a pointer directly, there's no point at all in feeding it through a load intrinsic. With `-O0`, that might well result in a load / store / load sequence. And Nominal, don't use that for NT loads. – Peter Cordes Feb 03 '16 at 00:10
1

@NominalAnimal Do you understand that your first example doesn't use a `volatile` access? Hence the reason it doesn't work, and hence my suggestion, which *really does* use a `volatile` access... – autistic Feb 03 '16 at 17:21
This explains why gcc continues to rearrange the accesses; it's still permitted to do so. – autistic Feb 03 '16 at 17:45
@seb: The four assignments in the point discussing `volatile`, yes: the `volatile` qualifier is dropped (and the compiler is allowed to do so) because the pointer is converted to non-volatile per the function call prototype. If I can think of a good way to word that, I'll amend the answer. Suggestions? – Nominal Animal Feb 04 '16 at 06:42
@NominalAnimal: you might want to edit out more of your earlier misunderstanding about `volatile`. It might be confusing to some people. It's an attribute of a type (if that's the right terminology), like `const`, so it doesn't survive casts, including function-arg. The `load` intrinsics work exactly like any other function. Spending so much discussing ordering the *loads* is especially bad, IMO, because it's not even important. I notice that you placed your barriers between loads and stores. This is prob. good; I maybe should have put a barrier between the last load and the first store – Peter Cordes Feb 04 '16 at 08:28
@PeterCordes: It wasn't so much a misunderstanding, but more like unfounded hope. I'm not *that* bad at C. (I don't like compiler devs who believe that simply because a standard does not require it, the compiler should not do it, even if it has important practical use cases.) I'm unsure of the importance of the load reordering; I should look into that, too. I'll see if I can come up with better wording for the `volatile` section (later); English isn't my native language, and I'm having difficulty wording this right now -- suggestions would be welcome. – Nominal Animal Feb 04 '16 at 08:50
@PeterCordes That and the attribute is to the left side of the type so `const int *` and `int const *` are pointers to `const int` or `int const`... and `int * const` is a `const` pointer to `int`. Similarly, `volatile __m256i *` is a pointer to `volatile __m256i` and accessing such a pointer is not a `volatile` access; you need to dereference the pointer to perform a `volatile` access with that type... Hence the remark you questioned the point of. – autistic Feb 04 '16 at 10:02
@Seb: I was saying that `*(volatile __m256i *)src` is good. Taking the address of that expression (by casting it to an array type) and feeding it into `_mm256_load` is *not useful*. At best it does nothing beyond the load you want. At worst (like at `-O0`), the compiler might store it back to memory and then load it again. Like I said, you only need the load intrinsics at all if you have something important to communicate to the compiler about alignment. – Peter Cordes Feb 04 '16 at 10:31
I made an edit that expresses it the way I think is sensible. I removed all suggestion that this is in any way a compiler bug. The intrinsic functions aren't really "special", and since there are no intrsinics that take volatile pointers, I don't see how you could reasonably expect the compiler to "guess" your meaning. I agree that compilers shouldn't break useful but slightly non-standard stuff like unions or pointer-casting for type-punning, but this isn't a case like that. Intrinsics can certainly be clunky; try safely using `pmovzx` as a load, when the intrin only has a `m128i` arg. – Peter Cordes Feb 04 '16 at 19:15
@PeterCordes `(__m256i[])` in `(__m256i[]) { *(volatile __m256i *) src }` does not denote "casting it to an array type"; the only cast there is `(volatile __m256i *) src`. Is it possible that you're basing all of your criticism off of an incorrect speculation? – autistic Feb 05 '16 at 04:41
1

@Seb: I read that as constructing an array with a single member, the value of which is obtained via a volatile access, and passing that as a parameter to the function. AIUI, the risk there is compiler literally constructing the array (thus, superfluous copies). – Nominal Animal Feb 05 '16 at 04:56
@Seb: The actual test of whether this has any value is what happens when you try to use a load intrinsic other than the normal aligned load that dereferencing a `__m256i*` already produces, e.g. `_mm256_loadu`. [gcc does an aligned load, then stores that to the stack with an unaligned store (and never reloads that), then does `vmovdqu ymm0, ymm0` (i.e. a no-op)](http://goo.gl/RNpZ2x). Clang optimizes it down to just an **aligned load**, not an unaligned load. (Which is correct, because by the time the load intrinsic sees it, it's just a temporary that's already loaded.) **Useless** – Peter Cordes Feb 05 '16 at 05:07
1

@PeterCordes: Thanks for the edit; I think it works much better than the original. My intent here (and in all my other answers) is not to just state an answer, but show how and why I arrived at the answer. The reason I hoped `volatile` would work, is C11 says in 6.7.3p7 that what constitutes access to an object that has volatile-qualified type is implementation-defined, so GCC *could* tag the AST generated as an access to a volatile object, even if the function prototype did not use such qualifier. On the other hand, 6.7.3p6 says access to a volatile object through non-volatile lvalue is UB.. – Nominal Animal Feb 05 '16 at 05:14
@NominalAnimal: Ok, now I see why you were hoping it would work. I also like the idea of providing background on why it's the right answer, but I try to leave out the wrong turns in the process. Otherwise my answers would be unreadably long. Editing down to just the useful part, in a coherent order, is always the hard part that takes *way* longer than just writing some code. If I typed *everything* I thought of while writing, I'd hit the 30k char limit. I did that once... (http://stackoverflow.com/a/32537772/224132). >. – Peter Cordes Feb 05 '16 at 05:21
Note that I did not just happen to declare the `m0`..`m3` inside the loop body. Any expression referring to a volatile-qualified object is evaluated strictly according to the abstract machine (C11 6.7.3p7). Initializers (initializers not part of a compound literal being full expressions) constitute sequence points, so (according to C11 5.1.2.3p3) these initializers must be initialized in the order specified. The funky/buggy part is, the return from a function call is also a sequence point, so AIUI GCC **should not** reorder the stores in any case at all. – Nominal Animal Feb 05 '16 at 05:24
@PeterCordes: Wording is difficult for me. I am *notorious* for my excess verbosity. I can't seem to help it; I just feel I'm leaving something important out (or, worse, trying to hide my errors!) when I try to cut out unnecessary stuff. Good, focusing edits, like yours to this answer, are very useful for me; thanks. – Nominal Animal Feb 05 '16 at 05:26
@Nominal: Sequence points define the logical order things happen in. The compiler can do anything as long as the code works as if it did what the source says. **Sequence points aren't globally visible**. C's memory model is very weak, meaning that compile-time reordering / hoisting of loads and stores is always allowed unless you prevent it with compiler barriers (atomic_signal_fence). Also, ya, leaving stuff out or simplifying is hard. I hate writing things I know aren't true in *all* cases, so I use "typically" or "almost always" instead of "always" or w/e. – Peter Cordes Feb 05 '16 at 05:34
@NominalAnimal I like that you've referred to the standard. You seem to have missed something, though; you'd need to cast to `__m128i * volatile` for the access you currently use to be volatile. This explains why I dereferenced the `__m128i volatile *` in my suggestion, and why your second example works (you're also dereferencing it, thus causing a `volatile` access). – autistic Feb 05 '16 at 05:50
1

@PeterCordes: Right; that's why I mentioned in my comment to your answer that I still have lots to learn about the C11 memory model. Note that C11 6.7.3p7 says *"**strictly** according to the rules of the abstract machine"*, I thought that makes a difference. Seb: Sure, but I do mean *access to a volatile object*, not *access to an object via a volatile pointer*; I was just hoping that having the volatile qualifier anywhere in the expression at all would make GCC add a "volatile" tag to the entire AST fragment. It would be useful. – Nominal Animal Feb 05 '16 at 06:10
@Seb: I've always been talking about dereferencing a a pointer-to-volatile. That's the *problem*: It gives you load ordering and forbids the compiler from hoisting loads, but it doesn't let specify unaligned loads. By far the most convincing argument you could make would be to modifty this [minimal access-to-volatile function](http://goo.gl/Q7AJnq) so it does a volatile unaligned access. When you have it working, click "permalink" and past the link here in a reply. If you can't actually apply the effect of a load intrinsic to the actual load, your idea is just wasted characters. – Peter Cordes Feb 05 '16 at 06:22
@NominalAnimal: Notice in that godbolt link I just posted, clang warns: `warning: passing 'volatile __m128i *' to parameter of type 'const __m128i *' discards qualifiers`. Consider what you could expect from `m256i load(const m256i*)` if it was separately-compiled: There's no way it could have any clue what the caller wanted. Would it be sane for it to matter whether it was inlined or not? No, that would be *in*sane. As I understand it, function inlining should never change the behaviour of a program (other than performance, or cases where the program had undefined behaviour). – Peter Cordes Feb 05 '16 at 06:27
Point of the prev comment: intrinsics aren't "special". They work exactly like proper functions, with all the same semantics for types in their args and return values. On some implementations, you can actually take the address of `_mm_loadu_si128`, but I think maybe not on others. For C11/C++11 memory model stuff, I found Jeff Preshing's blog was gold: http://preshing.com/20120930/weak-vs-strong-memory-models/ is a good starting point. – Peter Cordes Feb 05 '16 at 06:31
@PeterCordes What do you mean by *unaligned*? Everything is *aligned*, whether that be to your peril, byte-aligned or perhaps bit-aligned. Perhaps you're referring to things that are *unsuitably aligned*, that is not following the alignment requirements for a target type which I see no conclusive evidence of in this question. Nonetheless, you seem to be speculating that invoking undefined behaviour is appropriate (you're encouraging a "volatile *unaligned* access"). Are you aware that even on an x86, this might cause serious errors? Don't we need to talk about "correct code" before optimising? – autistic Feb 05 '16 at 06:56
@Seb: Do you understand the difference between `movdqu` and `movdqa`? If your accesses are 16B-aligned or 32B-aligned, you can use `movdqa` (and dereference pointers to `__m128i` types). If they're *not*, `movdqa` will fault. My whole point the entire time is that if the pointer you cast to `(volatile __m128i*)` is 16B aligned, there's no benefit to using the `_mm_load` intrinsic. If it's not 16B-aligned, you must go through a `_mm_loadu` intrinsics, and must *not* dereference it. If you don't even understand the term "unaligned" in this context, that's a big problem. – Peter Cordes Feb 05 '16 at 07:12
@PeterCordes: My intent for using `volatile` was to ensure the load is actually done, and done at that point. When compiling code for a function call where `volatile` qualifier is discarded, the compiler *could* treat the entire expression as "volatile" in that sense: strictly conforming to the C virtual machine rules, say by marking the AST sub-tree as "do not optimize away or reorder". It does not need to affect the function call itself, and the volatile qualifier gets discarded from the parameters -- function itself is not affected, only code in the caller. That was my hope. – Nominal Animal Feb 05 '16 at 07:46
@PeterCordes If you read my previous comment carefully, this time, you might determine: 1. I understand what you probably mean by "unaligned" as (more correctly) "unsuitably aligned"; 2. There's no evidence in the question to suggest that this happens, for example `src` is (or absolutely *should be*) returned by `malloc`, `calloc` or `realloc`; 3. If `src` isn't suitably aligned, violating alignment requirements is "undefined behaviour" and should not be relied upon let alone encouraged, even on x86. – autistic Feb 05 '16 at 08:01
@Seb: When I said "in this context", I meant in the context of x86 SSE/AVX instructions. There wouldn't be a `_mm_loadu_si128` intrinsic if you weren't supposed to be able to load 16 bytes from an arbitrarily aligned location. The whole reason the load and store intrinsics exist in the first place is so they can be used **instead of dereferencing**. You're right, it is undefined behaviour to dereference a `__m128i *` that isn't aligned to a 16B boundary. So if you *do* deref a `volatile __m128i*`, you gain nothing from a load intrin. If you need a loadu, you can't deref. – Peter Cordes Feb 05 '16 at 08:08
@PeterCordes ["A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined."](http://port70.net/~nsz/c/c11/n1570.html#6.3.2.3p7) Do you see how the fault is at the cast, not the access? Go back to the very top of this page and take a look at the question... Do you see how the question has the same fault? Are you sure you're commenting on the right part of this page? – autistic Feb 05 '16 at 08:10
@Seb: I didn't have space to add that you're right, the OP's code doesn't have this issue. His loads are aligned, so can be done with dereferencing. Anyway, the only point of your construct is so you can run your dereferenced pointer through `_mm_load_si128`. There's no reason to do this. The only benefit you could gain from the construct is when using it with a different load intrinsic, e.g. `_mm_loadu_si128` which can load from an unaligned pointer without undefined behaviour. Since `_mm_load_si128` is the same as a normal dereference, your construct has no benefit. – Peter Cordes Feb 05 '16 at 08:14
@Seb: maybe you haven't understood that the whole point of SIMD is to operate in parallel a buffer of `float`s or something. A `float *` doesn't have to be aligned to 16B, so you often have situations where a SIMD function needs to load its data from an input buffer that might possibly not be 16B-aligned. There is fast hardware support for unaligned loads, and the intrinsics expose this as `_mm_loadu`. *This* is what "unaligned loads" means in **this context**. Remember that SSE intrinsics are already platform-specific, and aren't part of any standard. – Peter Cordes Feb 05 '16 at 08:17
@Nominal: Oh, like don't reorder the function call? I see what you're thinking. No, that's not how it works. It's a pointer-to-volatile. Only dereferencing it invokes the behaviour of a volatile object. Checking it for NULL, assigning it to another variable of the same pointer-to-volatile type, and everything else can be fully optimized the same as any other pointer math. It's only access to the pointed-to volatile object that has to happen "here and now". – Peter Cordes Feb 05 '16 at 08:21
@PeterCordes The interface for `_mm_loadu_si128` would require a conversion which violates alignment requirements, resulting in undefined behaviour; that's not *without undefined behaviour* as you say it is. It may appear to function on your x86 system, but it's not a requirement. If your interface requires that you violate the C standard, that is unfortunate... Since you seem to be speaking from a very platform-specific standpoint with no reference to the standard (let alone an understanding of UB), I'm inclined to believe that you're the one who has misunderstood. Peace. -drops mic- – autistic Feb 05 '16 at 08:24
@Seb: Thanks for pointing out that even creating an unaligned pointer is UB, without dereferencing it. I know what UB in general is, and I misread your quote with what I was expecting to see. Clearly that behaviour is defined for compilers that implement SSE intrinscs, which aren't usable in portable code anyway... I never claimed that Intel's C/C++ intrinsics were wonderful or nicely designed (they're not). I'd much prefer it if `_mm_loadu_si128` took a `char*` or `void*`, or even were overloaded with versions for various integer types. `_mm_loadu_ps(float*)` already works that way. – Peter Cordes Feb 05 '16 at 08:39
@PeterCordes: "Only dereferencing it invokes the behaviour of a volatile object" -- well, C11 6.7.3p7 does say "What constitutes an access to an object that has volatile-qualified type is implementation-defined". As I see it, the standard would allow a compiler to do as I originally hoped/wished, and the behaviour (not reordering the call etc. if `volatile` qualifier is involved, even if dropped per function prototype) would certainly be useful in ordering of side effects. In C, `atomic_signal_fence()` et al. is quite new, and I dislike relying on C++ details for C.. – Nominal Animal Feb 05 '16 at 09:13
@NominalAnimal: Yes, that does sound like it would be a valid way for a C implementation to work, I stand corrected. However, that's not the kind of thing that *will* change for the major compilers at this point. I wish I knew the standard in more detail, but I don't wish badly enough to actually spend a lot of time studying it, xD. C11's `atomic_signal_fence` is pure C11. C++ has an identical function with an identical name, but that's nothing new for these languages. Note that the link in my answer is to the C page for it. (en.cppreference.com hosts C and C++ docs; watch the URL) – Peter Cordes Feb 05 '16 at 09:18
@PeterCordes: Yes, I know; but e.g. [gcc](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html#g_t_005f_005fatomic-Builtins) documents these via C++11. In any case, practice trumps theory; reality wins every time . The standard is basically just a collective agreement between compiler writers. What we have in our answers, are pretty robust (in practice) solutions to the original problem, fixing (GCC) memory access ordering. But.. I don't think the OP cares anymore. – Nominal Animal Feb 05 '16 at 09:29
@NominalAnimal: Questions are at least as important as places to put good / generally useful answers full of knowledge than as ways to directly help the one person that asked them. – Peter Cordes Feb 05 '16 at 09:31
@NominalAnimal Consider an object that has volatile-qualified type, versus an object that's been volatile-qualifier casted. For example, `int x;` x is an object that doesn't have volatile-qualified type... `*(int volatile *)&x` refers to the same object, though volatility has been casted into the lvalue. Does the object have volatile-qualified type? Now consider "What constitutes an access to an object that has volatile-qualified type is implementation-defined" – autistic Feb 05 '16 at 14:55
1

@Seb, I appreciate the effort, but I really am not that into language-lawyerism and determining semantics (== I don't trust my skill in such that far, and thus do not really care that much). It was my reason for an unfounded hope, which I immediately (before I even posted my answer) found to not work in practice. I originally retained in my post only in case others would thing along the same track (and show the track does not lead to a solution). There is no need to keep beating me with a clue bat. But, if you have comments or suggestions wrt. the example solution above, I'm listening. – Nominal Animal Feb 07 '16 at 10:33
@PeterCordes, I fully agree, but I find the OP's title is unlikely to lead others with similar problems (vectorized or MMX/SSE/AVX/AVX2 intrinsics getting reordered) to this question. Answering one person is not interesting, except if the problem itself is intrinsicly interesting. If no-one ever reads an answer, is the answer worth anything? – Nominal Animal Feb 07 '16 at 10:36

score 5 · Answer 2 · edited May 23 '17 at 12:14

First of all, normal people use gcc -O3 -march=native -S and then edit the .s to test small modifications to compiler output. I hope you had fun hex-editing that change. :P You could also use Agner Fog's excellent objconv to make disassembly that can be assembled back into a binary with your choice of NASM, YASM, MASM, or AT&T syntax.

Using some of the same ideas as Nominal Animal, I made a version that compiles to similarly good asm. I'm confident about why it compiles to good code though, and I have a guess about why the ordering matters so much:

CPUs only have a few (~10?) write-combining fill buffers for NT loads / stores.

See this article about copying from video memory with streaming loads, and writing to main memory with streaming stores. It's actually faster to bounce the data through a small buffer (much smaller than L1), to avoid having the streaming loads and streaming stores compete for fill buffers (esp. with out-of-order execution). Note that using "streaming" NT loads from normal memory is not useful. As I understand it, streaming loads are only useful for I/O (including stuff like video RAM, which is mapped into the CPU's address space in an Uncacheable Software-Write-Combining (USWC) region). Main-memory RAM is mapped WB (Writeback), so the CPU is allowed to speculatively pre-fetch it and cache it, unlike USWC. Anyway, so even though I'm linking an article about using streaming loads, I'm not suggesting using streaming loads. It's just to illustrate that contention for fill buffers is almost certainly the reason that gcc's weird code causes a big problem, where it wouldn't with normal non-NT stores.

Also see John McAlpin's comment at the end of this thread, as another source confirming that WC stores to multiple cache lines at once can be a big slowdown.

gcc's output for your original code (for some braindead reason I can't imagine) stored the 2nd half of the first cacheline, then both halves of the second cacheline, then the 1st half of the first cacheline. Probably sometimes the write-combining buffer for the 1st cacheline was getting flushed before both halves were written, resulting in less efficient use of external buses.

clang doesn't do any weird re-ordering with any of our 3 versions (mine, OP's, and Nominal Animal's).

Anyway, using compiler-only barriers that stop compiler reordering but don't emit a barrier instruction is one way to stop it. In this case, it's a way of hitting the compiler over the head and saying "stupid compiler, don't do that". I don't think you should normally need to do this everywhere, but clearly you can't trust gcc with write-combining stores (where ordering really matters). So it's probably a good idea to look at the asm at least with the compiler you're developing with when using NT loads and/or stores. I've reported this for gcc. Richard Biener points out that -fno-schedule-insns2 is a sort-of workaround.

Linux (the kernel) already has a barrier() macro that acts as a compiler memory barrier. It's almost certainly just a GNU asm volatile(""). Outside of Linux, you can keep using that GNU extension, or you can use the C11 stdatomic.h facilities. They're basically the same as the C++11 std::atomic facilities, with AFAIK identical semantics (thank goodness).

I put a barrier between every store, because they're free when there's no useful reordering possible anyway. It turns out just one barrier inside the loop keeps everything nicely in order, which is what Nominal Animal's answer is doing. It doesn't actually disallow the compiler from reordering stores that don't have a barrier separating them; the compiler just chose not to. This is why I barriered between every store.

I only asked the compiler for a write-barrier, because I expect only the ordering of the NT stores matters, not the loads. Even alternating load and store instructions probably wouldn't matter, since OOO execution pipelines everything anyway. (Note that the Intel copy-from-video-mem article even used mfence to avoid overlap between doing streaming stores and streaming loads.)

atomic_signal_fence doesn't directly document what all the different memory ordering options do with it. The C++ page for atomic_thread_fence is the one place on cppreference where there are examples and more on this.

This is the reason I didn't use Nominal Animal's idea of declaring src as pointer-to-volatile. gcc decides to keep the loads in the same order as stores.

Given that, unrolling only by 2 probably won't make any throughput difference in microbenchmarks, and will save uop cache space in production. Each iteration would still do a full cache line, which seems good.

SnB-family CPUs can't micro-fuse 2-reg addressing modes, so the obvious way to minimize loop overhead (get pointers to the end of src and dst, and then count a negative index up towards zero) doesn't work. The stores wouldn't micro-fuse. You'd very quickly fill up the fill-buffers to the point where the extra uops don't matter anyway, though. That loop probably runs nowhere near 4 uops per cycle.

Still, there is a way to reduce loop overhead: with my ridiculously ugly-and-unreadable-in-C hack to get the compiler to only do one sub (and a cmp/jcc) as loop overhead, no unrolling at all would make a 4-uop loop that should issue at one iteration per clock even on SnB. (Note that vmovntdq is AVX2, while vmovntps is only AVX1. Clang already uses vmovaps / vmovntps for the si256 intrinsics in this code! They have the same alignment requirement, and don't care what bits they store. It doesn't save any insn bytes, only compatibility.)

See the first paragraph for a godbolt link to this.

I guessed you were doing this inside the Linux kernel, so I put in appropriate #ifdefs so this should be correct as kernel code or when compiled for user-space.

#include <stdint.h>
#include <immintrin.h>

#ifdef __KERNEL__  // linux has it's own macro
//#define compiler_writebarrier()   __asm__ __volatile__ ("")
#define compiler_writebarrier()   barrier()
#else
// Use C11 instead of a GNU extension, for portability to other compilers
#include <stdatomic.h>
// unlike a single store-release, a release barrier is a StoreStore barrier.
// It stops all earlier writes from being delayed past all following stores
// Note that this is still only a compiler barrier, so no SFENCE is emitted,
// even though we're using NT stores.  So from another core's perpsective, our
// stores can become globally out of order.
#define compiler_writebarrier()   atomic_signal_fence(memory_order_release)
// this purposely *doesn't* stop load reordering.  
// In this case gcc loads in the same order it stores, regardless.  load ordering prob. makes much less difference
#endif

void copy_pjc(void *const destination, const void *const source, const size_t bytes)
{
          __m256i *dst  = destination;
    const __m256i *src  = source;
    const __m256i *dst_endp = (destination + bytes); // clang 3.7 goes berserk with intro code with this end condition
        // but with gcc it saves an AND compared to Nominal's bytes/32:

    // const __m256i *dst_endp = dst + bytes/sizeof(*dst); // force the compiler to mask to a round number


    #ifdef __KERNEL__
    kernel_fpu_begin();  // or preferably higher in the call tree, so lots of calls are inside one pair
    #endif

    // bludgeon the compiler into generating loads with two-register addressing modes like [rdi+reg], and stores to [rdi]
    // saves one sub instruction in the loop.
    //#define ADDRESSING_MODE_HACK
    //intptr_t src_offset_from_dst = (src - dst);
    // generates clunky intro code because gcc can't assume void pointers differ by a multiple of 32

    while (dst < dst_endp)  { 
#ifdef ADDRESSING_MODE_HACK
      __m256i m0 = _mm256_load_si256( (dst + src_offset_from_dst) + 0 );
      __m256i m1 = _mm256_load_si256( (dst + src_offset_from_dst) + 1 );
      __m256i m2 = _mm256_load_si256( (dst + src_offset_from_dst) + 2 );
      __m256i m3 = _mm256_load_si256( (dst + src_offset_from_dst) + 3 );
#else
      __m256i m0 = _mm256_load_si256( src + 0 );
      __m256i m1 = _mm256_load_si256( src + 1 );
      __m256i m2 = _mm256_load_si256( src + 2 );
      __m256i m3 = _mm256_load_si256( src + 3 );
#endif

      _mm256_stream_si256( dst+0, m0 );
      compiler_writebarrier();   // even one barrier is enough to stop gcc 5.3 reordering anything
      _mm256_stream_si256( dst+1, m1 );
      compiler_writebarrier();   // but they're completely free because we are sure this store ordering is already optimal
      _mm256_stream_si256( dst+2, m2 );
      compiler_writebarrier();
      _mm256_stream_si256( dst+3, m3 );
      compiler_writebarrier();

      src += 4;
      dst += 4;
    }

  #ifdef __KERNEL__
  kernel_fpu_end();
  #endif

}

It compiles to (gcc 5.3.0 -O3 -march=haswell):

copy_pjc:
        # one insn shorter than Nominal Animal's: doesn't mask the count to a multiple of 32.
        add     rdx, rdi  # dst_endp, destination
        cmp     rdi, rdx  # dst, dst_endp
        jnb     .L7       #,
.L5:
        vmovdqa ymm3, YMMWORD PTR [rsi]   # MEM[base: src_30, offset: 0B], MEM[base: src_30, offset: 0B]
        vmovdqa ymm2, YMMWORD PTR [rsi+32]        # D.26928, MEM[base: src_30, offset: 32B]
        vmovdqa ymm1, YMMWORD PTR [rsi+64]        # D.26928, MEM[base: src_30, offset: 64B]
        vmovdqa ymm0, YMMWORD PTR [rsi+96]        # D.26928, MEM[base: src_30, offset: 96B]
        vmovntdq        YMMWORD PTR [rdi], ymm3 #* dst, MEM[base: src_30, offset: 0B]
        vmovntdq        YMMWORD PTR [rdi+32], ymm2      #, D.26928
        vmovntdq        YMMWORD PTR [rdi+64], ymm1      #, D.26928
        vmovntdq        YMMWORD PTR [rdi+96], ymm0      #, D.26928
        sub     rdi, -128 # dst,
        sub     rsi, -128 # src,
        cmp     rdx, rdi  # dst_endp, dst
        ja      .L5 #,
        vzeroupper
.L7:

Clang makes a very similar loop, but the intro is much longer: clang doesn't assume that src and dest are actually both aligned. Maybe it doesn't take advantage of the knowledge that the loads and stores will fault if not 32B-aligned? (It knows it can use ...aps instructions instead of ...dqa, so it certainly does more compiler-style optimization of intrinsics that gcc (where they more often always turn into the relevant instruction). clang can turn a pair of left/right vector shifts into a mask from a constant, for example.)

Very interesting! You delve much deeper into the CPU architecture than I do, and your comments are excellent. This spurred me into installing `clang-3.5` just to see if I could rewrite my function (but still keeping to "my style", so to speak) but get good code out of both GCC and clang at all optimization levels (except `-O0`, at which GCC is hopeless). I can see I still have lots to learn about the C11 memory model and atomics. Thanks! — Nominal Animal, Feb 02 '16 at 17:19

Wrong gcc generated assembly ordering, results in performance hit

2 Answers2

Linked