The question is almost a duplicate of Why is integer assignment on a naturally aligned variable atomic on x86?. The answer there does answer everything you ask, but this question is more focused on the ABI / compiler question of whether an int
(or other type?) will be sufficiently-aligned, rather than what happens when it is. There's other stuff in this question that's worth answering specifically, too.
Yes, they almost invariably will be on machines where an int
fits in a single register (e.g. not AVR: an 8-bit RISC), because compilers typically choose not to use multiple store instructions when they could use 1.
Normal x86 ABIs will align an int
to a 4B boundary, even inside structs (unless you use GNU C __attribute__((packed))
or the equivalent for other dialects). But beware that the i386 System V ABI only aligns double
to 4 bytes; it's only outside structs that modern compilers can go beyond that and give it natural alignment, making load/store atomic.
But nothing you can legally do in C++ can ever depend on this fact (because by definition it will involve a data race on a non-atomic
type so it's Undefined Behaviour). Fortunately, there are efficient ways to get the same result (i.e. about the same compiler-generated asm, without mfence
instructions or other slow stuff) that don't cause undefined behaviour.
You should use atomic
instead of volatile
or hoping that the compiler doesn't optimize away stores or loads on a non-volatile int
, because the assumption of async modification is one of the ways that volatile
and atomic
overlap.
I'm dealing with a real-time system, which is sampling some signal and putting the result into one global int, this is of course done in one thread. And in yet another thread I read this value and process it.
std::atomic
with .store(val, std::memory_order_relaxed)
and .load(std::memory_order_relaxed)
will give you exactly what you want here. The HW-access thread runs free and does plain ordinary x86 store instructions into the shared variable, while the reader thread does plain ordinary x86 load instructions.
This is the C++11 way to express that this is what you want, and you should expect it to compile to the same asm as with volatile
. (With maybe a couple instructions difference if you use clang, but nothing important.) If there was any case where volatile int
wouldn't have sufficient alignment, or any other corner cases, atomic<int>
will work (barring compiler bugs). Except maybe in a packed struct; IDK if compilers stop you from breaking atomicity by packing atomic types in structs.
In theory, you might want to use volatile std::atomic<int>
to make sure the compiler doesn't optimize out multiple stores to the same variable. See Why don't compilers merge redundant std::atomic writes?. But for now, compilers don't do that kind of optimization. (volatile std::atomic<int>
should still compile to the same light-weight asm.)
I know a single 4-byte int definitely fits into a typically 128-byte-long cache line, and if that int does store inside one cache line then I believe no issues here...
Cache lines are 64B on all mainstream x86 CPUs since PentiumIII; before that 32B lines were typical. (Well AMD Geode still uses 32B lines...) Pentium4 uses 64B lines, although it prefers to transfer them in pairs or something? Still, I think it's accurate to say that it really does use 64B lines, not 128B. This page lists it as 64B per line.
AFAIK, there are no x86 microarchitectures that used 128B lines in any level of cache.
Also, only Intel CPUs guarantee that cached unaligned stores / loads are atomic if they don't cross a cache-line boundary. The baseline atomicity guarantee for x86 in general (AMD/Intel/other) is don't cross an 8-byte boundary. See Why is integer assignment on a naturally aligned variable atomic on x86? for quotes from Intel/AMD manuals.
Natural alignment works on pretty much any ISA (not just x86) up to the maximum guaranteed-atomic width.
The code in your question wants a non-atomic read-modify write where the load and store are separately atomic, and impose no ordering on surrounding loads/stores.
As everyone has said, the right way to do this is with atomic<int>
, but nobody has pointed out exactly how. If you just n++
on atomic_int n
, you will get (for x86-64) lock add [n], 1
, which will be much slower than what you get with volatile
, because it makes the entire RMW operation atomic. (Perhaps this is why you were avoiding std::atomic<>
?)
#include <atomic>
volatile int vcount;
std::atomic <int> acount;
static_assert(alignof(vcount) == sizeof(vcount), "under-aligned volatile counter");
void inc_volatile() {
while(1) vcount++;
}
void inc_separately_atomic() {
while(1) {
int t = acount.load(std::memory_order_relaxed);
t++;
acount.store(t, std::memory_order_relaxed);
}
}
asm output from the Godbolt compiler explorer with gcc7.2 and clang5.0
Unsurprisingly, they both compile to equivalent asm with gcc/clang for x86-32 and x86-64. gcc makes identical asm for both, except for the address to increment:
# x86-64 gcc -O3
inc_volatile:
.L2:
mov eax, DWORD PTR vcount[rip]
add eax, 1
mov DWORD PTR vcount[rip], eax
jmp .L2
inc_separately_atomic():
.L5:
mov eax, DWORD PTR acount[rip]
add eax, 1
mov DWORD PTR acount[rip], eax
jmp .L5
clang optimizes better, and uses
inc_separately_atomic():
.LBB1_1:
add dword ptr [rip + acount], 1
jmp .LBB1_1
Note the lack of a lock
prefix, so inside the CPU this decodes to separate load, ALU add, and store uops. (See Can num++ be atomic for 'int num'?).
Besides smaller code-size, some of these uops can be micro-fused when they come from the same instruction, reducing front-end bottlenecks. (Totally irrelevant here; the loop bottlenecks on the 5 or 6 cycle latency of a store/reload. But if used as part of a larger loop, it would be relevant.) Unlike with a register operand, add [mem], 1
is better than inc [mem]
on Intel CPUs because it micro-fuses even more: INC instruction vs ADD 1: Does it matter?.
It's interesting that clang uses the less efficient inc dword ptr [rip + vcount]
for inc_volatile()
.
And how does an actual atomic RMW compile?
void inc_atomic_rmw() {
while(1) acount++;
}
# both gcc and clang do this:
.L7:
lock add DWORD PTR acount[rip], 1
jmp .L7
Alignment inside structs:
#include <stdint.h>
struct foo {
int a;
volatile double vdouble;
};
// will fail with -m32, in the SysV ABI.
static_assert(alignof(foo) == sizeof(double), "under-aligned volatile counter");
But atomic<double>
or atomic<unsigned long long>
will guarantee atomicity.
For 64-bit integer load/store on 32-bit machines, gcc uses SSE2 instructions. Some other compilers unfortunately use lock cmpxchg8b
, which is far less efficient for separate stores or loads. volatile long long
wouldn't give you that.
volatile double
usually would normally be atomic to load/store when aligned correctly, because the normal way is already to use single 8B load/store instructions.