How do the costs of atomic operations compare on common architectures (x86(version), arm(version), and PowerPC(version))?
Bonus points if you include a sourced estimate of cycles, and explain in terms of C11 Memory Orderings or include the instructions used on an architecture.
Extra bonus points if you can include uncommon/proposed architectures like RISC-V or Mill architecture.