1

Are 16-byte atomic<> variables automatically aligned on 16-byte boundaries allowing the compiler/runtime libraries to efficiently use the x86 CMPXCHG16B instruction? Or should we as a matter of style always manually specify alignas(16) for all such variables?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Swiss Frank
  • 1,985
  • 15
  • 33

1 Answers1

6

Any decent implementation of std::atomic<> will use alignas itself to make lock cmpxchg16b efficient, if the library uses lock cmpxchg16b at all instead of a mutex for 16-byte objects.

Not all implementations do, for example I think MSVC's standard library makes 16-byte objects fully non-lock-free using the standard mutex fallback.

You don't need alignas(16) on atomic<T>.

You only need manual alignment for atomics if you have a plain T object that you want to use atomic_ref on. atomic_ref<> has no mechanism to align an already existing T object. The current version of the design exposes a required_alignment member you should use. It's up to you to do that for correctness. (Otherwise you get UB which could mean tearing, or just extremely slow system-wide performance for split lock RMWs.)

 // for atomic_ref<T>
alignas(std::atomic_ref<T>::required_alignment) T sometimes_atomic_var;

 // often equivalent, and doesn't require checking that atomic_ref<T> is supported
alignas(std::atomic<T>) T sometimes_atomic_var;
 // use the same alignment as atomic<T>

Note that a misaligned lock cmpxchg16b split across a cache line boundary would still be atomic but very very slow (same as any locked instruction: the atomicity guarantee for atomic RMW is not contingent on alignment). More like an actual bus lock, instead of just a local-to-this-core cache lock delaying MESI responses.

Narrower atomics definitely need to be naturally aligned for correctness because pure-load and pure-store can compile to asm pure load or store where HW guarantees require some alignment.

But 16-byte objects are only guaranteed atomic with lock cmpxchg16b so .load() and .store() have to be implemented with lock cmpxchg16b. (Load with CAS(0,0) to get the old value and either replace 0 with itself or do nothing, and store with a CAS retry loop. This sucks but is somewhat better than a mutex. It doesn't have the read-side scalability you'd expect from a lock-free load, which is one reason GCC7 and later no longer advertizes atomic<16-byte-object> as lock-free, even though it will still use lock cmpxchg16b in the libatomic functions it calls instead of inlining lock cmpxchg16b)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 2
    Perhaps suggest `static_assert(alignof(std::atomic<...>) >= sizeof(std::atomic<...>));` – Acorn May 25 '20 at 01:21
  • @Acorn: If you're trying to detect libraries that don't internally use `lock cmpxchg16b` at all, then maybe. e.g. if you're rolling your own atomic accesses to the object sometimes, and need to be sure that compiler-generated accesses are using `lock`ed instructions, not relying on some mutex that you'd ignore. – Peter Cordes May 25 '20 at 01:24
  • 3
    Might be worth mentioning in your answer: Linux 5.7 is getting a new option to send `SIGBUS` to processes that try to take a split lock. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6650cdd9a8ccf00555dbbe743d58541ad8feb6a7 – Joseph Sible-Reinstate Monica May 25 '20 at 02:03
  • 2
    @JosephSible-ReinstateMonica: There's also a perf counter for it on most hardware. `sq_misc.split_lock`. But yes, as that kernel patch notes, it's very very bad including for other cores, so much so that other HW features were introduced to detect it. Terminology note: the lock isn't an object that you "take", it's a modifier on the access. This is about instructions with a `lock` prefix that do one atomic transaction. – Peter Cordes May 25 '20 at 02:19
  • Yeah, my wording there was a bit sloppy. I meant processes that do a `lock`-prefixed instruction across cache lines that makes the **core** take the global bus lock. – Joseph Sible-Reinstate Monica May 25 '20 at 02:26
  • 2
    Manual specification of alignment may be required for `atomic_ref`, as it cannot control its _external_ underlying variable. Descent implementation of `atomic` will even pad, say, 11 byte underlying type to 16. – Alex Guteniev May 25 '20 at 04:14
  • @AlexGuteniev: The OP didn't ask about atomic_ref in this question, but yes I guess it would be a good idea. (And I did recently answer one of the OP's other questions with `atomic_ref` as the answer.) – Peter Cordes May 25 '20 at 04:38
  • ... _and maybe correctness depending on how careful the atomic_ref implementation is_ -- could you please elaborate it in my question, do you think a good implementaion is expected to be correct in this case https://stackoverflow.com/questions/61996108/atomic-ref-when-external-underlying-type-is-not-aligned-as-requested – Alex Guteniev May 25 '20 at 05:19
  • 1
    @AlexGuteniev: Last I looked, I don't think `atomic_ref` included a `required_alignment` member. My earlier phrasing was based on that; updated. Either I missed it or the standards committee only later noticed that atomicity sometimes requires more than `alignof(T)`, and that nobody wants compilers to emit code that checks alignment at runtime to decide whether a specific object is lock-free or not. (i.e. everyone hates `.is_lock_free()`, loves `.is_always_lock_free`) – Peter Cordes May 25 '20 at 05:36
  • N4861 includes `required_alignment`, so probably it will be in the final standard. I'm trying to contribute `atomic_ref` to MSVC ( https://github.com/microsoft/STL/pull/843 ), and found out that it is possible to reuse `atomic` implementation for `atomic_ref`, without breaking `atomic` ABI. But runtime fallback for properly-sized but misaligned types makes it _way_ more complicated. – Alex Guteniev May 25 '20 at 05:48
  • 1
    @AlexGuteniev: I'd certainly hope something like `required_alignment` makes it into the final standard!! I wasn't implying that it might not, just that I'd seen an earlier sample implementation of `atomic_ref` a year or more ago that might not have had anything like that, but still blindly assumed alignment of the underlying object. (Making it unsafe for `int64_t` on 32-bit x86 for example.) Or like I said, maybe I just missed the existing of a `required_alignment` member and only noticed the safety risk. – Peter Cordes May 25 '20 at 05:50
  • Very informative guys. With my current sub-project, I'm happy to just use atomic operations even for the phases where I don't need atomic access, and the extra verbosity isn't bad anyway. I was mostly asking in order to inform my ideas going forward and your extra info over and above my initial question gives me a lot to shape my future direction with. – Swiss Frank May 25 '20 at 07:22