4

I've read this, my question is quite similar yet somewhat different.

Note, I know C++0x does not guarantee that but I'm asking particularly for a multi-core machine like x86-64.

Let's say we have 2 threads (pinned to 2 physical cores) running the following code:

// I know people may delcare volatile useless, but here I do NOT care memory reordering nor synchronization/
// I just want to suppress complier optimization of using register.
volatile int n; 

void thread1() {
    for (;;)
        n = 0xABCD1234;
        // NOTE, I know ++n is not atomic,
        // but I do NOT care here.
        // what I cares is whether n can be 0x00001234, i.e. in the middle of the update from core-1's cache lines to main memory,
        // will core-2 see an incomplete value(like the first 2 bytes lost)?
        ++n; 
    }
}

void thread2() {
    while (true) {
        printf('%d', n);
    }
}

Is it possible for thread 2 to see n to be something like 0x00001234, i.e. in the middle of the update from core-1's cache lines to main memory, will core-2 see an incomplete value?

I know a single 4-byte int definitely fits into a typically 128-byte-long cache line, and if that int does store inside one cache line then I believe no issues here... yet what if it acrosses the cache line boundary? i.e. will it be possbile that some char already sit inside that cache line which makes first part of the n in one cache line and the other part in the next line? If that is the case, then core-2 may have a chance seeing an incomplete value, right?

Also, I think unless making every char or short or other less-than-4-bytes types padded to be 4-byte-long, one can never guarantee a single int does not pass the cache line boundary, isn't it?

If so, would that suggest generally even setting a single int is not guaranteed to be atomic on a x86-64 multi-core machine?

I got this question because as I researched on this topic, various people in various posts seem agreed on that as long as the machine architecture is proper(e.g. x86-64) setting an int should be atomic. But as I argued above that does not hold, right?

UPDATE

I'd like to give some background of my question. I'm dealing with a real-time system, which is sampling some signal and putting the result into one global int, this is of course done in one thread. And in yet another thread I read this value and process it. I do not care the ordering of set and get, all I need is just a complete (vs. a corrrupted integer value) value.

Julian
  • 43
  • 1
  • 4
  • 4
    Why do you need to know that? Use std::atomic. Compiler writers already took care of this, you are unikely to outperform them. – n. m. could be an AI Sep 07 '17 at 09:15
  • Note that having a an `int` not be on a four-byte boundary is *much* slower on x86, so the compiler writers will add the necessary padding - unless you take specific steps to stop them. – Martin Bonner supports Monica Sep 07 '17 at 09:31
  • Martin, so in theory the cross-the-cache-boundary scenario does can happen? In terms of the atomic, would it put some lock in the assembly? As in my system, lock and sychronization is unecessary and I just want to avoid. – Julian Sep 07 '17 at 09:39
  • @MartinBonner: actually, modern x86 is *very* efficient with unaligned data. No penalty at all in most cases as long as no cache-line boundary is crossed. There will be a minor penalty if it crosses a cache line boundary, or (on Intel pre-Skylake) a major penalty if it crosses a 4k boundary. But on Skylake, load-use latency is still only about 11 cycles (instead of 4 or 5) for a load that crosses a 4k boundary. Still, it is slower, and can cause 2 cache misses, so most small data types do get naturally-aligned in most ABIs. – Peter Cordes Sep 13 '17 at 03:05
  • @PeterCordes : Sigh. You learn about how x86 architecture works ... and then they ***** well go and change it! It's still the case though that most compilers will naturally align things so that a single int never crosses a cache -line boundary. (But I wonder if it would be faster overall to pack things tightly in memory and get more in cache at the cost of occasional penalties when you cross a boundary.) – Martin Bonner supports Monica Sep 13 '17 at 07:08
  • @MartinBonner: Well the ABI rules set struct layout, so generally you should sort from large to small members or otherwise avoid wasting space on padding anyway. Outside of structs, most programs don't have a lot of static data. Compilers can do whatever they want in stack memory for automatic storage, and yeah it makes sense to align locals, since there is still a significant penalty for crossing a cache line boundary and the stack is only 16B aligned. Anyway, with good design it should be rare to waste much space on padding, so there's not much that one would do differently. – Peter Cordes Sep 13 '17 at 07:21

5 Answers5

8

x86 guarantees this. C++ doesn't. If you write x86 assembly you will be fine. If you write C++ it is undefined behavior. Since you can't reason about undefined behavior (it is undefined after all) you have to go lower and look at the assembler instructions that were generated. If they do what you want then this is fine. Note, however, that compilers tend to change generated assembly when you change compilers, compiler versions, compiler flags or any code which might change the optimizer's behavior, so you will constantly have to check the assembler code to make sure it is still correct.

The easier way is to use std::atomic<int> which will guarantee that the correct assembler instructions are generated so you don't have to constantly check.

nwp
  • 9,623
  • 5
  • 38
  • 68
6

The other question talks about variables "properly aligned". If it crosses a cache-line, the variable is not properly aligned. An int will not do that unless you specifically ask the compiler to pack a struct, for example.

You also assume that using volatile int is better than atomic<int>. If volatile int is the perfect way to sync variables on your platform, surely the library implementer would also know that and store a volatile x inside atomic<x>.

There is no requirement that atomic<int> has to be extra slow just because it is standard. :-)

Bo Persson
  • 90,663
  • 31
  • 146
  • 203
  • Beware that `atomic n;` and then `n++` will be extra slow, because it makes the entire read-modify-write atomic, instead of just the loads/stores separately atomic. The correct way to write what the OP did above is with `std::atomic`, but it's more verbose. See [my answer](https://stackoverflow.com/questions/46092373/are-reads-and-writes-of-an-int-in-c-atomic-on-x86-64-multi-core-machine/46188132#46188132). – Peter Cordes Sep 13 '17 at 02:59
3

If you're looking for atomicity guarantee, std::atomic<> is your friend. Don't rely on volatile qualifier.

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243
Michał Fita
  • 1,183
  • 1
  • 7
  • 24
  • Sorry, if I didn't help. – Michał Fita Sep 07 '17 at 09:17
  • @Julian The remark about `volatile` is a valid one since it creates a benign race on some platforms (ie. it works the way it was intended). If you remove `volatile`, `thread2` may no longer observe updates to `n` – LWimsey Sep 07 '17 at 09:28
  • 1
    I'm not sure I understand the downvote. A short answer is not necessarily a poor one. – Bathsheba Sep 07 '17 at 09:31
  • Michal F and LWimsey, thanks for the reply. My point is I know volatile does not guarantee sychronization nor memory ordering, it only tells complier to save the value back into memory instead of keeping it register. My question is about whether setting an integer is atomic or not. – Julian Sep 07 '17 at 09:32
  • @Julian you had the answer in the question you linked - proper Intel documentation included. – Michał Fita Sep 07 '17 at 10:10
3

Why worry so much?

Rely on your implementation. std::atomic<int> will reduce to an int if int is atomic on your platform (and in x86-64 they are, if properly aligned).

I'd also be concerned about the possibility of int overflow with your code (which is undefined behaviour), if I were you.

In other words std::atomic<unsigned> is the appropriate type here.

Bathsheba
  • 231,907
  • 34
  • 361
  • 483
  • 3
    This is not that simple. Atomic increment is much-much slower than a plain increment. And accessing an aligned int is atomic anyway on x86 (an aligned int never crosses cache lines). So If the OP knows what's he doing (i.e. it doesn't matter whether n++ is atomic or not), it is fine to not use atomic in this case. – geza Sep 07 '17 at 10:16
  • @geza: you can do `var.store( 1+var.load(mo_relaxed), mo_relaxed)` to emulate++ on a `volatile`. But yes, `alignof(int)=4` in all real 32-bit and 64-bit ABIs, and compilers don't normally access it with separate byte stores (although they're allowed to). Still, if you use it with the minimum memory-order for what you need, `atomic` is no less efficient than `volatile`. Hard to guarantee that plain `int` is safe, though. – Peter Cordes Nov 27 '19 at 06:23
2

The question is almost a duplicate of Why is integer assignment on a naturally aligned variable atomic on x86?. The answer there does answer everything you ask, but this question is more focused on the ABI / compiler question of whether an int (or other type?) will be sufficiently-aligned, rather than what happens when it is. There's other stuff in this question that's worth answering specifically, too.


Yes, they almost invariably will be on machines where an int fits in a single register (e.g. not AVR: an 8-bit RISC), because compilers typically choose not to use multiple store instructions when they could use 1.

Normal x86 ABIs will align an int to a 4B boundary, even inside structs (unless you use GNU C __attribute__((packed)) or the equivalent for other dialects). But beware that the i386 System V ABI only aligns double to 4 bytes; it's only outside structs that modern compilers can go beyond that and give it natural alignment, making load/store atomic.

But nothing you can legally do in C++ can ever depend on this fact (because by definition it will involve a data race on a non-atomic type so it's Undefined Behaviour). Fortunately, there are efficient ways to get the same result (i.e. about the same compiler-generated asm, without mfence instructions or other slow stuff) that don't cause undefined behaviour.

You should use atomic instead of volatile or hoping that the compiler doesn't optimize away stores or loads on a non-volatile int, because the assumption of async modification is one of the ways that volatile and atomic overlap.

I'm dealing with a real-time system, which is sampling some signal and putting the result into one global int, this is of course done in one thread. And in yet another thread I read this value and process it.

std::atomic with .store(val, std::memory_order_relaxed) and .load(std::memory_order_relaxed) will give you exactly what you want here. The HW-access thread runs free and does plain ordinary x86 store instructions into the shared variable, while the reader thread does plain ordinary x86 load instructions.

This is the C++11 way to express that this is what you want, and you should expect it to compile to the same asm as with volatile. (With maybe a couple instructions difference if you use clang, but nothing important.) If there was any case where volatile int wouldn't have sufficient alignment, or any other corner cases, atomic<int> will work (barring compiler bugs). Except maybe in a packed struct; IDK if compilers stop you from breaking atomicity by packing atomic types in structs.

In theory, you might want to use volatile std::atomic<int> to make sure the compiler doesn't optimize out multiple stores to the same variable. See Why don't compilers merge redundant std::atomic writes?. But for now, compilers don't do that kind of optimization. (volatile std::atomic<int> should still compile to the same light-weight asm.)


I know a single 4-byte int definitely fits into a typically 128-byte-long cache line, and if that int does store inside one cache line then I believe no issues here...

Cache lines are 64B on all mainstream x86 CPUs since PentiumIII; before that 32B lines were typical. (Well AMD Geode still uses 32B lines...) Pentium4 uses 64B lines, although it prefers to transfer them in pairs or something? Still, I think it's accurate to say that it really does use 64B lines, not 128B. This page lists it as 64B per line.

AFAIK, there are no x86 microarchitectures that used 128B lines in any level of cache.

Also, only Intel CPUs guarantee that cached unaligned stores / loads are atomic if they don't cross a cache-line boundary. The baseline atomicity guarantee for x86 in general (AMD/Intel/other) is don't cross an 8-byte boundary. See Why is integer assignment on a naturally aligned variable atomic on x86? for quotes from Intel/AMD manuals.

Natural alignment works on pretty much any ISA (not just x86) up to the maximum guaranteed-atomic width.


The code in your question wants a non-atomic read-modify write where the load and store are separately atomic, and impose no ordering on surrounding loads/stores.

As everyone has said, the right way to do this is with atomic<int>, but nobody has pointed out exactly how. If you just n++ on atomic_int n, you will get (for x86-64) lock add [n], 1, which will be much slower than what you get with volatile, because it makes the entire RMW operation atomic. (Perhaps this is why you were avoiding std::atomic<>?)

#include <atomic>
volatile int vcount;
std::atomic <int> acount;
static_assert(alignof(vcount) == sizeof(vcount), "under-aligned volatile counter");

void inc_volatile() {
    while(1) vcount++;
}
void inc_separately_atomic() {
    while(1) {
        int t = acount.load(std::memory_order_relaxed);
        t++;
        acount.store(t, std::memory_order_relaxed);
    }
}

asm output from the Godbolt compiler explorer with gcc7.2 and clang5.0

Unsurprisingly, they both compile to equivalent asm with gcc/clang for x86-32 and x86-64. gcc makes identical asm for both, except for the address to increment:

# x86-64 gcc -O3
inc_volatile:
.L2:
    mov     eax, DWORD PTR vcount[rip]
    add     eax, 1
    mov     DWORD PTR vcount[rip], eax
    jmp     .L2
inc_separately_atomic():
.L5:
    mov     eax, DWORD PTR acount[rip]
    add     eax, 1
    mov     DWORD PTR acount[rip], eax
    jmp     .L5

clang optimizes better, and uses

inc_separately_atomic():
.LBB1_1:
        add     dword ptr [rip + acount], 1
        jmp     .LBB1_1

Note the lack of a lock prefix, so inside the CPU this decodes to separate load, ALU add, and store uops. (See Can num++ be atomic for 'int num'?).

Besides smaller code-size, some of these uops can be micro-fused when they come from the same instruction, reducing front-end bottlenecks. (Totally irrelevant here; the loop bottlenecks on the 5 or 6 cycle latency of a store/reload. But if used as part of a larger loop, it would be relevant.) Unlike with a register operand, add [mem], 1 is better than inc [mem] on Intel CPUs because it micro-fuses even more: INC instruction vs ADD 1: Does it matter?.

It's interesting that clang uses the less efficient inc dword ptr [rip + vcount] for inc_volatile().


And how does an actual atomic RMW compile?

void inc_atomic_rmw() {
    while(1) acount++;
}

# both gcc and clang do this:
.L7:
    lock add        DWORD PTR acount[rip], 1
    jmp     .L7

Alignment inside structs:

#include <stdint.h>
struct foo {
    int a;
    volatile double vdouble;
};

// will fail with -m32, in the SysV ABI.
static_assert(alignof(foo) == sizeof(double), "under-aligned volatile counter");

But atomic<double> or atomic<unsigned long long> will guarantee atomicity.

For 64-bit integer load/store on 32-bit machines, gcc uses SSE2 instructions. Some other compilers unfortunately use lock cmpxchg8b, which is far less efficient for separate stores or loads. volatile long long wouldn't give you that.

volatile double usually would normally be atomic to load/store when aligned correctly, because the normal way is already to use single 8B load/store instructions.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Interesting that clang combined the atomic load and store into an RMW. It has been hard to find examples of optimizations of atomic ops so far as most compilers have been conservative, but maybe that's changing (or maybe this case existed all along and I never saw it. ) – BeeOnRope Sep 13 '17 at 20:10
  • @BeeOnRope: That's not at all the same as optimizing away any of the operations in the source (e.g. coalescing multiple stores so an intermediate value is never globally visible). It's merely doing the load+add+store with one instruction. The fact that gcc *doesn't* shows us something about how it handles `volatile` / `atomic` accesses internally vs. how it optimizes during the final code-gen stage or something, since it misses this optimization which preserves behaviour exactly (except on uniprocessor, where it does make it an atomic RMW, but clang isn't thinking about that I'm sure). – Peter Cordes Sep 13 '17 at 22:23
  • Yes, but I never mentioned "optimizing away", did I? I have never seen any optimization at all where a `load()` or `store()` on an `atomic` was turned into anything other than a load or store at the assembly level 1:1 in the source, _except_ on `clang` _if_ the compiler knew the variable was completely local (e.g,. on the stack) at which point clang could fully optimize it. Here's a non-local `atomic` variable being significantly optimized (it eliminated the local entirely, and applied the effective op via RMW to the atomic). – BeeOnRope Sep 13 '17 at 22:53
  • ... I wonder if it was peepholed and if the peepholer knows about ordering restrictions from the source or whether it simply only applies peepholes that can't affect the ordering. – BeeOnRope Sep 13 '17 at 22:58
  • @BeeOnRope: I misunderstood your comment. You said "most compilers have been conservative", implying (I thought) there was some kind of observable behaviour change in this optimization, so I thought you were talking about the other kind of optimization: https://stackoverflow.com/questions/45960387/why-dont-compilers-merge-redundant-stdatomic-writes. Anyway, gcc has always failed to optimize even `volatile`, and it appears clang3.6 was the first clang version to do this optimization. – Peter Cordes Sep 13 '17 at 23:03
  • @Bee: There are other optimizations clang still fails to do with `atomic`, but does with `volatile`. e.g. folding a volatile load into a memory source operand for `add` or `imul`. But it won't with an `atomic` relaxed load. https://godbolt.org/g/4ASe2K see `fetch()`. – Peter Cordes Sep 13 '17 at 23:05
  • Right, well I meant any kind of optimization, including merging redundant stores, coalescing overlapping loads, etc, etc and including replacing the above with a RMW. I don't draw a specific line anywhere. Before I had only seen the "provably local" optimizations. It does seem to be a change in behavior, starting with clang 3.6. – BeeOnRope Sep 13 '17 at 23:20
  • @BeeOnRope: Interesting point, I guess it does change behaviour for a signal handler (or on a UP system). It's obviously a different kind of change, and far less aggressive, than coalescing stores or loads. I wonder if whoever implemented it even considered the behaviour change; because to me it's more like finding one of those missed optimizations in `fetch()`, folding a load into a memory source for `add`, which is definitely purely local. Lots of those optimizations are still missed, so it's not like clang has decided to start optimizing atomics. As you say, maybe a simple peephole. – Peter Cordes Sep 13 '17 at 23:43
  • Well it's more aggressive, IMO, than say eliminating two [totally dead consecutive loads](https://godbolt.org/g/4o7msq) (whose values are never used). Those loads can simply be dropped (or combined into one). What I suspect could be happening is that this is caught in a peephole because turning read-mod-write ops into a single RMW is a real thing that could be useful in a peepholer (and may not happen in earlier forms this is quite arch-specific), while stuff like eliminating dead loads happens earlier. The former still happens in this case, but the latter doesn't because of the phase. – BeeOnRope Sep 13 '17 at 23:46
  • Sorry by "observable behaviour change" I thought you meant that the compiler had changed its optimization strategy over time to be more aggressive (apparently it did, but it may or may not be atomics-related), but it's clear you were talking about changing program behavior. That's tougher to talk about - since the allowable states of the optimized version is a strict subset of the original, I treat this as not an interesting behavior change, even on UP. I.e., "you'll never see state X anymore" is only interesting if before there always some way to see state X. – BeeOnRope Sep 13 '17 at 23:52
  • @BeeOnRope: oh good, you aren't crazy. :P Yeah, I didn't think it was a very relevant change in program behaviour. Not like an optimization that shrinks (to zero?) the window for another thread to get access to something by atomically doing something that used to be separate. – Peter Cordes Sep 14 '17 at 00:03
  • Thanks for the detailed explanation. But I'm not convinced my question is a dup with your link. My question is more asking (apologizes if it doesnt seem so to you) whether a normal program (WITHOUT special trick making compiler unaligning things) would result in some unaligned data which is likely to make it cross the cache boundary (and hence not atomic). Also, on the volatile bit, I was using it as I don't want compiler make it into an register as I want multiple cores to compete doing the assignment. As I said, I don't care the instrucion re-order issue here. – Julian Sep 14 '17 at 02:31
  • Moreover, I'm even fine with that I miss some values just because the multi-core system doing the assignments too quickly. My only requirement is the value set or read is not corrupted. Normally people may suggest I use atomic instead of volatile (as they did in this post) but the standard does say that there is one special scenario that volatile can fit in. And I think my case (HW measuring the data and program real-time processing) seems matching that scenario. – Julian Sep 14 '17 at 02:33
  • @Julian: My answer on the other question does point out that typical ABIs align `int` to 4B, but yeah, that's kind of the main focus of this question, but a side-note there. That's why I decided not to close this as a duplicate. – Peter Cordes Sep 14 '17 at 02:44
  • @Julian: Yes, if you memory-map the hardware you're reading into your process's memory, you should access it through a `volatile int*` or `volatile char*`. But once you've read the data out of the hardware, getting it to another thread has *nothing* to do with where it came from. Using `volatile` correctly in one part of your program doesn't somehow make it correct to use it for anything else in the same program. As my answer shows, `std::atomic` `.load()` and `.store()` with `memory_order_relaxed` will give pretty much the same thing as `volatile`. – Peter Cordes Sep 14 '17 at 02:52
  • @Julian: updated my answer with a section to explain that `atomic` should have no extra overhead vs. `volatile`. Why do you want to avoid `atomic`? If your program ever wants to do anything more than spam results in one thread and sample in another, using `atomic` means you're all set to add more synchronization. – Peter Cordes Sep 14 '17 at 03:09