0

I want to know if there is any different between std::atomic<int> and int if we are just doing load and store. I am not concerned about the memory ordering. For example consider the below code

int x{1};

void f(int myid) {

    while(1){
        while(x!= myid){}
        //cout<<"thread : "<< myid<<"\n";
        //this_thread::sleep_for(std::chrono::duration(3s));
        x = (x % 3) + 1;
    }
}

int main(){

    thread x[3];
    for(int i=0;i<3;i++){

        x[i] = thread(f,i+1);
    }

    for(int i=0;i<3;i++){

        x[i].join();
    }
}

Now the output (if you uncomment the cout) will be

Thread :1

Thread :2

Thread :3

...

I want to know if there is any benefit in changing the int x to atomic<int> x?

cigien
  • 57,834
  • 11
  • 73
  • 112
Gtrex
  • 247
  • 1
  • 7
  • 1
    Did you try to Google it? E.g. [`std::atomic` on cppref](https://en.cppreference.com/w/cpp/atomic/atomic). You get defined behavior on concurrent read write – JHBonarius Oct 23 '20 at 19:48
  • 1
    "if we are just doing load and store" -- what else would you do with an int (or any variable, for that matter)? – Barmar Oct 23 '20 at 19:51
  • There's all sorts of different output you could get for this code, atomic or not.. – user4581301 Oct 23 '20 at 19:55
  • I am new to multithreading.. can someone explain where we are breaking the synchronization here. Providing the underlying hardware has atomic int load and store instructions – Gtrex Oct 23 '20 at 20:06
  • `std::atomic` is just a template specialization of `std::atomic`. There is a real well explained answer about what is `std::atomic` here: https://stackoverflow.com/a/31978762/9580873 – Dayrion Oct 23 '20 at 19:50
  • Print your code assembly language. Interrupts can occur after any instruction. So pick critical spot, save the registers, then execute with a new thread. Search the internet for "concurrent execution". – Thomas Matthews Oct 23 '20 at 23:45
  • 1
    BTW, having a local `x[]` shadow a global `x` is pretty bad style. And if you want to count loop iterations before a thread exits, increment a counter and store it somewhere, instead of printing (which would force the compiler to spill vars from registers.) – Peter Cordes Oct 24 '20 at 11:37

2 Answers2

5

Consider your code:

void f(int myid) {
    while(1){
        while(x!= myid){}
        //cout<<"thread : "<< myid<<"\n";
        //this_thread::sleep_for(std::chrono::duration(3s));
        x = (x % 3) + 1;
    }
}

If the program didn't have undefined behaviour, then you could expect that when f was called, x would be read from the stack at least once, but having done that, the compiler has no reason to think that any changes to x will happen outside the function, or that any changes to x made within the function need to be visible outside the function until after the function returns, so it's entitled to read x into a CPU register, keep looking at the same register value and comparing it to myid - which means it'll either pass through instantly or be stuck forever.

Then, compilers are allowed to assume they'll make progress (see Forward Progress in the C++ Standard), so they could conclude that because they'd never progress if x != myid, x can't possibly be equal to myid, and remove the inner while loop. Similarly, an outer loop simplified to while (1) x = (x % 3) + 1; where x might be a register - doesn't make progress and could also be eliminated. Or, the compiler could leave the loop but remove the seemingly pointless operations on x.

Putting your code into the online Godbolt compiler explorer and compiling with GCC trunk at -O3 optimisation, f(int) code is:

f(int):
.L2:
    jmp     .L2

If you make x atomic, then the compiler can't simply use a register while accessing/modifying it, and assume that there will be a good time to update it before the function returns. It will actually have to modify the variable in memory and propagate that change so other threads can read the updated value.

Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
  • I totally agree on the register part. lets say I provide a memory fence after the write operation.. would it help? I know all this can be done with atomics but I just want to know where the code will break the synchronization.. – Gtrex Oct 23 '20 at 20:37
  • 2
    [The Standard (most recent draft I can find) on Forward Progress](http://eel.is/c++draft/intro.progress) – user4581301 Oct 23 '20 at 20:37
  • @Gtrex: if you used inline asm or similar to inject a memory fence CPU instruction, no - because the CPU itself doesn't maintain a record of which memory address a CPU register may be later written back to; if you use `std::atomic_thread_fence` then yes - it can require the compiler to orchestrate writes towards memory around the `std::atomic_thread_fence` call, but do note that the side watching for changes to `x` also has to opt in to using fences (otherwise it could just keep looking at a register). Overall, it's much easier to make `x` itself atomic. – Tony Delroy Oct 23 '20 at 20:50
  • (Well, with GCC inline assembly, I think there's a notation to say that the assembly depends on the value of the global `x` variable, which would cause the compiler to issue machine code to update it before issuing the inline code; GCC's pretty handy at such things) – Tony Delroy Oct 23 '20 at 20:54
  • @TonyDelroy, thanks for the explanation, I think I got the answer I am looking for – Gtrex Oct 23 '20 at 20:57
  • @Gtrex: if you do want more details, just ask a separate question so people can address it properly (and find it when they're interested in the same). Cheers – Tony Delroy Oct 23 '20 at 21:30
  • 2
    @TonyDelroy: Yes, in GNU C/C++, `asm("" ::: "memory")` is a compiler barrier, forcing the actual memory contents to be in sync with the abstract machine. (Except for variables it can prove no other thread could have a reference to, like local variables whose address hasn't escaped this function...) – Peter Cordes Oct 24 '20 at 01:48
  • 1
    You could use a non-empty asm template, like x86 `asm("mfence" ::: "memory")`, to force this core to wait for e.g. the store buffer to drain before running later loads, giving even more ordering than the hardware ISA's native memory model. (e.g. x86's is program-order + a store buffer, so acquire and release fences are just compiler barriers, no asm instructions required unless you want seq_cst.) Also related, @Gtrex: [When to use volatile with multi threading?](https://stackoverflow.com/a/58535118) explains how/why you can in practice roll your own atomics, but don't. – Peter Cordes Oct 24 '20 at 01:49
  • 1
    @Gtrex: more directly related: [Multithreading program stuck in optimized mode but runs normally in -O0](https://stackoverflow.com/a/58527396) is an example that analyzes exactly how/why a program breaks when you use plain non-atomic vars without `volatile` or anything: optimization assumes no data-race UB. Same reasoning as this answer. Also [MCU programming - C++ O2 optimization breaks while loop](https://electronics.stackexchange.com/a/387478) – Peter Cordes Oct 24 '20 at 01:51
  • @PeterCordes Thanks Peter - appreciate the input and links. Cheers – Tony Delroy Oct 24 '20 at 05:06
  • 1
    Cheers. I forgot to link https://preshing.com/20120625/memory-ordering-at-compile-time/, preshing's articles are nicely beginner-friendly, and an interesting mix of C and x86 asm for low-level how-it-works mental models. (Written when C++11 was still very new, so roll-you-own atomics were still a thing.) – Peter Cordes Oct 24 '20 at 05:21
  • 1
    BTW, see [Who's afraid of a big bad optimizing compiler?](https://lwn.net/Articles/793253/) - using *just* barriers without `volatile` isn't a safe way to roll your own atomics. Yes you can force the compiler to access memory, but you can't stop it from potentially accessing *more than* once without `volatile`. e.g. it could invent loads and make code that sees 2 different values of a load when you thought you were just looking at the same local temporary. Among other things. I'm not sure my previous comments made that point; happened to see this question again while looking for a duplicate. – Peter Cordes Jun 14 '22 at 08:52
2

I want to know if there is any benefit in changing the int x to atomic x?

You could say that. Turning int into atomic<int> in your example will turn your program from incorrect to correct (*).

Accessing the same int from multiple threads at the same time (without any form of access synchronization) is Undefined Behavior.


*) Well, the program might still be incorrect, but at least it avoids this particular problem.

bolov
  • 72,283
  • 15
  • 145
  • 224
  • 2
    It still might not be "correct", it just avoids undefined behavior. – Barmar Oct 23 '20 at 19:52
  • @Barmar, can you please explain why the program is not correct with atomics? – Gtrex Oct 23 '20 at 21:06
  • @Gtrex I said *might* not be correct. It could have other problems besides undefined behavior due to accessing the variable concurrently. – Barmar Oct 23 '20 at 21:08
  • @Gtrex Both threads could read the variable at the same time, and increment it to the same next value. – Barmar Oct 23 '20 at 21:09
  • I guess the thread will increment only if the value of x is equal to myid which is thread specific and all the read and write to 'x' is atomic with barriers so the value is propagated to all thread. – Gtrex Oct 23 '20 at 21:11
  • @Gtrex: Reads and writes to `x` would be separately atomic, but the whole operation would *not* be one atomic RMW. Given threads A and B, you could have A:load B:load, B:store A:store, with A's store stepping on B's store, both same val. To implement an atomic RMW that atomically replaces `x` with `(x%3) + 1` of the *same* value that was originally there, you could use a CAS retry loop with `compare_exchange_weak`. You could just `fetch_add` and derive the value to use from the atomic counter with a `%3`. Unfortunately 3 relatively prime to 2^32 so wrapping won't work perfectly, though. – Peter Cordes Oct 24 '20 at 02:01
  • @PeterCordes, sorry for being investigative, but as you have mentioned, A:load B:load followed by B:store A:store can occur. I need to know how this condition will occur where both A:Read and B:Read occurs (which is OK), but then both threads will go to and do write operation. I believe only one thread will get the chance to write it as both thread will see the same value for x (x being atomic here and again assuming no compiler optimizations) – Gtrex Oct 24 '20 at 04:47
  • @Gtrex: Maybe a better example would be if A sleeps for a while between its load and store, and B does several load/store iterations before A eventually wakes up and stores its value. Then it's overwriting a value that is totally unrelated. That "sleep" could be due to an interrupt handler, or just MESI negotiation for which core owns the cache line, losing ownership of it before the write commits. (And BTW, without seq_cst to force an mfence / full barrier (the default for `x.store()` or `x = ...`), each thread can be reloading its own stores without seeing the other thread's stores.) – Peter Cordes Oct 24 '20 at 05:33
  • @Gtrex: but IDK why you think only one thread would get a chance to write; how would one CPU core know that another thread happened to have loaded the same value? I mean yes in that case the 2nd store wouldn't change the value in memory. The problem in my first example is just "losing counts". If you'd been doing `x = x + 1` (without wraparound at a tiny boundary), 10k times total, the final `x` value would *not* be 10k higher than the initial. Usually that's a problem. If not, yes avoid an RMW and just write it as `x = expression involving x`; Even better use `memory_order_release`. – Peter Cordes Oct 24 '20 at 05:36
  • @PeterCordes, I totally agree that '''x = (x%3) + 1''' RMW is not atomic even if x is atomic. 2 threads try to execute this statement will have undefined result. You said "if A sleeps for a while between its load and store, and B does several load/store iterations before A eventually wakes up and stores its value". My question is why would `while(x != myid)` not allow the serial access of `x = x(x%3)+1`. The load in the loop is atomic (seq_cst) so each thread would load the latest value and the RMW (`x = (x%3) + 1`) would be visible to all threads because this is also seq_cst. – Gtrex Oct 24 '20 at 10:57
  • @Gtrex: Oh. Yes of course the read of `x` in `x != myid` can see a value of `x` stored by this or another thread. So what? Yes of course your threads will exit soon without any UB, instead of getting stuck in an infinite loop because of data-race UB. That would be true even with `memory_order_relaxed`, which still guarantees global visibility. seq_cst just gives ordering wrt. *other* operations. (And a guarantee that a global total order of atomic operations exists across multiple atomic objects. Most hardware already guarantees this; only a few can do IRIW reordering.) – Peter Cordes Oct 24 '20 at 11:40
  • Thanks again @PeterCordes for clearing that..I really appreciate the effort – Gtrex Oct 24 '20 at 12:20