6

Why is the assembly output of store_idx_x86() the same as store_idx() and load_idx_x86() the same as load_idx()?

It was my understanding that __atomic_load_n() would flush the core's invalidation queue, and __atomic_store_n() would flush the core's store buffer.

Note -- I complied with: gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16)

Update: I understand that x86 will never reorder stores with other stores and loads with other loads -- so is gcc smart enough to implement sfence and lfence only when it is needed or should using __atomic_ result in a fence (assuming a memory model stricter than __ATOMIC_RELAXED)?

Code

#include <stdint.h>


inline void store_idx_x86(uint64_t* dest, uint64_t idx)
{   
    *dest = idx;    
}

inline void store_idx(uint64_t* dest, uint64_t idx)
{
    __atomic_store_n(dest, idx, __ATOMIC_RELEASE);
}

inline uint64_t load_idx_x86(uint64_t* source)
{
    return *source;

}

inline uint64_t load_idx(uint64_t* source)
{
    return __atomic_load_n(source, __ATOMIC_ACQUIRE);
}

Assembly:

.file   "util.c"
    .text
    .globl  store_idx_x86
    .type   store_idx_x86, @function
store_idx_x86:
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movq    %rdi, -8(%rbp)
    movq    %rsi, -16(%rbp)
    movq    -8(%rbp), %rax
    movq    -16(%rbp), %rdx
    movq    %rdx, (%rax)
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   store_idx_x86, .-store_idx_x86
    .globl  store_idx
    .type   store_idx, @function
store_idx:
.LFB1:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movq    %rdi, -8(%rbp)
    movq    %rsi, -16(%rbp)
    movq    -8(%rbp), %rax
    movq    -16(%rbp), %rdx
    movq    %rdx, (%rax)
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE1:
    .size   store_idx, .-store_idx
    .globl  load_idx_x86
    .type   load_idx_x86, @function
load_idx_x86:
.LFB2:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movq    %rdi, -8(%rbp)
    movq    -8(%rbp), %rax
    movq    (%rax), %rax
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE2:
    .size   load_idx_x86, .-load_idx_x86
    .globl  load_idx
    .type   load_idx, @function
load_idx:
.LFB3:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movq    %rdi, -8(%rbp)
    movq    -8(%rbp), %rax
    movq    (%rax), %rax
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE3:
    .size   load_idx, .-load_idx
    .ident  "GCC: (GNU) 4.8.2 20140120 (Red Hat 4.8.2-16)"
    .section    .note.GNU-stack,"",@progbits
Bigtree
  • 147
  • 1
  • 8
  • For 80x86, aligned loads and stores are guaranteed (by the CPU) to be atomic. If the compiler can guarantee the load/store is always correctly aligned, the code you see above is fine. However... – Brendan Feb 09 '15 at 02:54
  • 3
    ..I'm not necessarily convinced that the compiler can guarantee the load/store will always be correctly aligned (e.g. `(uint64_t*)&myArrayOfChar[3]`) and therefore I'm not convinced it is fine; unless you abuse "implementation defined" (with regard to pointer type conversions) as a lame excuse for "unexpected but technically permitted to ruin your entire week" behaviour (which is something GCC developers seem to have become fond of). – Brendan Feb 09 '15 at 02:54
  • It might be interesting to change * dest from uint64_t to void, in which case the compiler probably shouldn't assume * dest is aligned. – Timothy Johns Feb 09 '15 at 14:31
  • __atomic_ won't accept void or structs – Bigtree Feb 09 '15 at 17:50
  • The atomicity part is explained by https://stackoverflow.com/questions/36624881/why-is-integer-assignment-on-a-naturally-aligned-variable-atomic. But it's not really a duplicate because you also ask about fences, and the answer here gets that right. See also http://preshing.com/20120930/weak-vs-strong-memory-models/ – Peter Cordes Sep 01 '17 at 22:38
  • @brendan - compilers have long assumed that memory is correctly aligned and the various aliasing and casting rules are there to ensure it is preserved. So GCC isn't doing anything that other compilers haven't been doing forever in that respect. In particular, if compilers didn't assume alignment you pretty much couldn't generate efficient code at all on platforms that don't allow misaligned memory operations. While they are less common today, they exist and historically there have been many, so assuming alignment has been important pretty much forever. – BeeOnRope Sep 01 '17 at 23:22

1 Answers1

2

Why is the assembly output of store_idx_x86() the same as store_idx() and load_idx_x86() the same as load_idx()?

On x86, assuming compiler-enforced alignment, they are the same operations. Loads and Stores to aligned addresses of the native size or smaller are guaranteed to be atomic. Reference Intel manual vol 3A, 8.1.1:

The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically: Reading or writing a quadword aligned on a 64-bit boundary [...]

Furthermore, x86 enforces a strongly ordered memory model, meaning every store and load has implicit release and acquire semantics, respectively.

Lastly, the fencing instructions you mention are only required when using Intel's non-temporal SSE instructions (great reference here), or when needing to create a store-load fence (article here) (and that one is the mfence or lock instruction actually).

Aside: I was curious about that statement in Intel's manuals, so I devised a test program. Frustratingly, on my computer (2 core i3-4030U), I get this output from it:

unaligned
4265292 / 303932066 | 1.40337%
unaligned, but in same cache line
2373 / 246957659 | 0.000960893%
aligned (8 byte)
0 / 247097496 | 0%

Which seems to violate what Intel says. I will investigate. In the meantime, you should clone that demo program and see what it gives you. You just need -std=c++11 ... -pthread on linux.

Myles Hathcock
  • 443
  • 2
  • 11
  • As a followup to this, check out my blog post: https://relativebit.com/2015/04/07/cpp-unaligned-memory-accesses.html – Myles Hathcock May 21 '15 at 02:39
  • Updated link for the above comment: https://hathcock.sh/2015/06/17/torn_reads.html – Myles Hathcock Apr 10 '16 at 03:13
  • 1
    Did you ever sort out what happened with your test for unaligned within a cache line? Your links are down. Presumably Intel's manuals are right and you made a mistake, because they still say that accesses are atomic if you don't cross a cache-line boundary. :P (Applies only to *cached* accesses, i.e. on write-back memory, not uncacheable, but you would have had to go out of your way to mmap some device memory or VGA memory.) – Peter Cordes Sep 01 '17 at 22:35