Are 16-byte atomic<>
variables automatically aligned on 16-byte boundaries allowing the compiler/runtime libraries to efficiently use the x86 CMPXCHG16B
instruction? Or should we as a matter of style always manually specify alignas(16)
for all such variables?

- 328,167
- 45
- 605
- 847

- 1,985
- 15
- 33
1 Answers
Any decent implementation of std::atomic<>
will use alignas
itself to make lock cmpxchg16b
efficient, if the library uses lock cmpxchg16b
at all instead of a mutex for 16-byte objects.
Not all implementations do, for example I think MSVC's standard library makes 16-byte objects fully non-lock-free using the standard mutex fallback.
You don't need alignas(16)
on atomic<T>
.
You only need manual alignment for atomics if you have a plain T
object that you want to use atomic_ref
on. atomic_ref<>
has no mechanism to align an already existing T object. The current version of the design exposes a required_alignment
member you should use. It's up to you to do that for correctness. (Otherwise you get UB which could mean tearing, or just extremely slow system-wide performance for split lock
RMWs.)
// for atomic_ref<T>
alignas(std::atomic_ref<T>::required_alignment) T sometimes_atomic_var;
// often equivalent, and doesn't require checking that atomic_ref<T> is supported
alignas(std::atomic<T>) T sometimes_atomic_var;
// use the same alignment as atomic<T>
Note that a misaligned lock cmpxchg16b
split across a cache line boundary would still be atomic but very very slow (same as any lock
ed instruction: the atomicity guarantee for atomic RMW is not contingent on alignment). More like an actual bus lock, instead of just a local-to-this-core cache lock delaying MESI responses.
Narrower atomics definitely need to be naturally aligned for correctness because pure-load and pure-store can compile to asm pure load or store where HW guarantees require some alignment.
But 16-byte objects are only guaranteed atomic with lock cmpxchg16b
so .load()
and .store()
have to be implemented with lock cmpxchg16b
. (Load with CAS(0,0) to get the old value and either replace 0 with itself or do nothing, and store with a CAS retry loop. This sucks but is somewhat better than a mutex. It doesn't have the read-side scalability you'd expect from a lock-free load
, which is one reason GCC7 and later no longer advertizes atomic<16-byte-object>
as lock-free, even though it will still use lock cmpxchg16b
in the libatomic functions it calls instead of inlining lock cmpxchg16b
)

- 328,167
- 45
- 605
- 847
-
2Perhaps suggest `static_assert(alignof(std::atomic<...>) >= sizeof(std::atomic<...>));` – Acorn May 25 '20 at 01:21
-
@Acorn: If you're trying to detect libraries that don't internally use `lock cmpxchg16b` at all, then maybe. e.g. if you're rolling your own atomic accesses to the object sometimes, and need to be sure that compiler-generated accesses are using `lock`ed instructions, not relying on some mutex that you'd ignore. – Peter Cordes May 25 '20 at 01:24
-
3Might be worth mentioning in your answer: Linux 5.7 is getting a new option to send `SIGBUS` to processes that try to take a split lock. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6650cdd9a8ccf00555dbbe743d58541ad8feb6a7 – Joseph Sible-Reinstate Monica May 25 '20 at 02:03
-
2@JosephSible-ReinstateMonica: There's also a perf counter for it on most hardware. `sq_misc.split_lock`. But yes, as that kernel patch notes, it's very very bad including for other cores, so much so that other HW features were introduced to detect it. Terminology note: the lock isn't an object that you "take", it's a modifier on the access. This is about instructions with a `lock` prefix that do one atomic transaction. – Peter Cordes May 25 '20 at 02:19
-
Yeah, my wording there was a bit sloppy. I meant processes that do a `lock`-prefixed instruction across cache lines that makes the **core** take the global bus lock. – Joseph Sible-Reinstate Monica May 25 '20 at 02:26
-
2Manual specification of alignment may be required for `atomic_ref`, as it cannot control its _external_ underlying variable. Descent implementation of `atomic` will even pad, say, 11 byte underlying type to 16. – Alex Guteniev May 25 '20 at 04:14
-
@AlexGuteniev: The OP didn't ask about atomic_ref in this question, but yes I guess it would be a good idea. (And I did recently answer one of the OP's other questions with `atomic_ref` as the answer.) – Peter Cordes May 25 '20 at 04:38
-
... _and maybe correctness depending on how careful the atomic_ref implementation is_ -- could you please elaborate it in my question, do you think a good implementaion is expected to be correct in this case https://stackoverflow.com/questions/61996108/atomic-ref-when-external-underlying-type-is-not-aligned-as-requested – Alex Guteniev May 25 '20 at 05:19
-
1@AlexGuteniev: Last I looked, I don't think `atomic_ref
` included a `required_alignment` member. My earlier phrasing was based on that; updated. Either I missed it or the standards committee only later noticed that atomicity sometimes requires more than `alignof(T)`, and that nobody wants compilers to emit code that checks alignment at runtime to decide whether a specific object is lock-free or not. (i.e. everyone hates `.is_lock_free()`, loves `.is_always_lock_free`) – Peter Cordes May 25 '20 at 05:36 -
N4861 includes `required_alignment`, so probably it will be in the final standard. I'm trying to contribute `atomic_ref` to MSVC ( https://github.com/microsoft/STL/pull/843 ), and found out that it is possible to reuse `atomic` implementation for `atomic_ref`, without breaking `atomic` ABI. But runtime fallback for properly-sized but misaligned types makes it _way_ more complicated. – Alex Guteniev May 25 '20 at 05:48
-
1@AlexGuteniev: I'd certainly hope something like `required_alignment` makes it into the final standard!! I wasn't implying that it might not, just that I'd seen an earlier sample implementation of `atomic_ref` a year or more ago that might not have had anything like that, but still blindly assumed alignment of the underlying object. (Making it unsafe for `int64_t` on 32-bit x86 for example.) Or like I said, maybe I just missed the existing of a `required_alignment` member and only noticed the safety risk. – Peter Cordes May 25 '20 at 05:50
-
Very informative guys. With my current sub-project, I'm happy to just use atomic operations even for the phases where I don't need atomic access, and the extra verbosity isn't bad anyway. I was mostly asking in order to inform my ideas going forward and your extra info over and above my initial question gives me a lot to shape my future direction with. – Swiss Frank May 25 '20 at 07:22