Are atomic types necessary in multi-threading? (OS X, clang, c++11)

Question

I'm trying to demonstrate that it's very bad idea to not use std::atomic<>s but I can't manage to create an example that reproduces the failure. I have two threads and one of them does:

{
    foobar = false;
}

and the other:

{
    if (foobar) {
        // ...
    }
}

the type of foobar is either bool or std::atomic_bool and it's initialized to true. I'm using OS X Yosemite and even tried to use this trick to hint via CPU affinity that I want the threads to run on different cores. I run such operations in loops etc. and in any case, there's no observable difference in execution. I end up inspecting generated assembly with clang clang -std=c++11 -lstdc++ -O3 -S test.cpp and I see that the asm differences on read are minor (without atomic on left, with on right):

No mfence or something that "dramatic". On the write side, something more "dramatic" happens:

As you can see, the atomic<> version uses xchgb which uses an implicit lock. When I compile with a relatively old version of gcc (v4.5.2) I can see all sorts of mfences being added which also indicates there's a serious concern.

I kind of understand that "X86 implements a very strong memory model" (ref) and that mfences might not be necessary but does it mean that unless I want to write cross-platform code that e.g. supports ARM, I don't really need to put any atomic<>s unless I care for consistency at ns-level?

I've watched "atomic<> Weapons" from Herb Sutter but I'm still impressed with how difficult it is to create a simple example that reproduces those problems.

Is your bool actually accessible from multiple threads? Best to provide full example program, not just a snippet with relevant details missing! Also, gcc 4.5 is rather old, optimizers have evolved since then. — hyde, Sep 10 '16 at 18:14
Generally, the rules of a programming language *never* give you a guaranteed method of producing a problem. It's the other way round: The rules only give you a method to create a program that is guaranteed to *not* have problems. — Kerrek SB, Sep 10 '16 at 18:14
Are you sure what you _really_ want isn't just demonstrating that "it's a very bad idea for different threads to use the same variables/heap areas with no synchronization"? If that's the case, it does not translate to "nor uing `std::atomic` is a very bad idea". — einpoklum, Sep 10 '16 at 18:21
Related: https://stackoverflow.com/questions/39393850/can-num-be-atomic-for-int-num — milleniumbug, Sep 10 '16 at 18:21
@einpoklum I'm just reviewing code very similar to the one Sebastian Redl has on his comment below. So yes - it's not the general problem of race conditions/lack of sync'on which is better understood and easier to get consensus. This is on the "here's my code, it's ok and if you have any problem give me a counter-example... I think I'm ok and I don't care what the standard says or doesn't say" :) — neverlastn, Sep 10 '16 at 18:43
Be sure to look at the part of Herb's presentation where he shows what a compiler's back-end must do to implement atomics on various processor architectures. It is somewhere in part 2. Enough to make you realize you are looking at the wrong code, reads are easy on x86 and just requires treating the variable volatile. The write requires extra code. — Hans Passant, Sep 11 '16 at 13:01

score 5 · Accepted Answer · answered Sep 10 '16 at 18:15

5

The big problem of data races is that they're undefined behavior, not guaranteed wrong behavior. And this, in conjunction with the the general unpredictability of threads and the strength of the x64 memory model, means that it gets really hard to create reproduceable failures.

A slightly more reliable failure mode is when the optimizer does unexpected things, because you can observe those in the assembly. Of course, the optimizer is notoriously finicky as well and might do something completely different if you change just one code line.

Here's an example failure that we had in our code at one point. The code implemented a sort of spin lock, but didn't use atomics.

bool operation_done;
void thread1() {
  while (!operation_done) {
    sleep();
  }
  // do something that depends on operation being done
}
void thread2() {
  // do the operation
  operation_done = true;
}

This worked fine in debug mode, but the release build got stuck. Debugging showed that execution of thread1 never left the loop, and looking at the assembly, we found that the condition was gone; the loop was simply infinite.

The problem was that the optimizer realized that under its memory model, operation_done could not possibly change within the loop (that would have been a data race), and thus it "knew" that once the condition was true once, it would be true forever.

Changing the type of operation_done to atomic_bool (or actually, a pre-C++11 compiler-specific equivalent) fixed the issue.

answered Sep 10 '16 at 18:15

Sebastian Redl

69,373
8
123
157

5

@gnasher729 That's not what volatile means in C++, so you know ... glasshouses. – Sebastian Redl Sep 10 '16 at 18:23
@SebastianRedl: I was actually under the wrong impression gnasher729 was right - until I read the docs again. Perhaps you could add a sentence to your answer explaining why volatile won't do here. – einpoklum Sep 10 '16 at 18:26
@einpoklum: Since `volatile` wasn't brought up in the OP, it might actually be counterproductive to introduce the idea just to shoot it down. – Sep 10 '16 at 18:30
1

@Hurkyl: But the OP did ask about what was _necessary_ for multi-threaded work - it's natural to ask yourself "wait, isn't `volatile` enough?" – einpoklum Sep 10 '16 at 18:32
@einpoklum: It's only natural to ask yourself if `volatile` is enough if you have been previously come across the idea that `volatile` is how you do these things. For someone who hasn't, it's a tradeoff between the dangers of introducing them to a wrong idea they haven't encountered yet vs. the benefit of preemptively dissuading them in case they encounter it in the future. – Sep 10 '16 at 18:34
@SebastianRedl - was it any well known compiler that was eliminating the condition? Do you remember version? – neverlastn Sep 10 '16 at 18:37
1

@Hurkyl: People who use Java or C# are likely to have come across that idea, and Java is the most popular programming language these days AFAIK. – einpoklum Sep 10 '16 at 18:37
@neverlastn It was the Intel C++ compiler for Windows, I think version 10 but not sure about that. – Sebastian Redl Sep 10 '16 at 19:02
@einpoklum Not only that, MSVC sometimes interprets `volatile` as `memory_order_seq_cst`, creating even more confusion. This seems to be the default for non-ARM architectures, see [here](https://msdn.microsoft.com/en-us/library/jj204392.aspx). – sbabbi Sep 10 '16 at 22:09

neverlastn · Answer 2 · 2016-09-11T16:30:46.850

This is my own version of @Sebastian Redl's answer that fits the question more closely. I will still accept his for credit + kudos to @HansPassant for his comment which brought my attention back to writes which made everything clear - since as soon as I observed that the compiler was adding synchronization on writes, the problem turned to be that it wasn't optimizing bool as much as one would expect.

I was able to have a trivial program that reproduces the problem:

std::atomic_bool foobar(true);
//bool foobar = true;

long long cnt = 0;
long long loops = 400000000ll;

void thread_1() {
    usleep(200000);
    foobar = false;
}

void thread_2() {
    while (loops--) {
        if (foobar) {
            ++cnt;
        }
    }
    std::cout << cnt << std::endl;
}

The main difference with my original code was that I used to have a usleep() inside the while loop. It was enough to prevent any optimizations within the while loop. The cleaner code above, yields the same asm for write:

but quite different for read:

We can see that in the bool case (left) clang brought the if (foobar) outside the loop. Thus when I run the bool case I get:

400000000

real    0m1.044s
user    0m1.032s
sys 0m0.005s

while when I run the atomic_bool case I get:

95393578

real    0m0.420s
user    0m0.414s
sys 0m0.003s

It's interesting that the atomic_bool case is faster - I guess because it does just 95 million incs on the counter contrary to 400 million in the bool case.

What is even more crazy-interesting though is this. If I move the std::cout << cnt << std::endl; out of the threaded code, after pthread_join(), the loop in the non-atomic case becomes just this:

i.e. there's no loop. It's just if (foobar!=0) cnt = loops;! Clever clang. Then the execution yields:

400000000

real    0m0.206s
user    0m0.001s
sys 0m0.002s

while the atomic_bool remains the same.

So more than enough evidence that we should use atomics. The only thing to remember is - don't put any usleep() on your benchmarks because even if it's small, it will prevent quite a few compiler optimizations.

gnasher729 · Answer 3 · 2016-09-10T18:23:25.253

-8

In general, it is very rare that the use of atomic types actually does anything useful for you in multithreaded situations. It is more useful to implement things like mutexes, semaphores and so on.

One reason why it's not very useful: As soon as you have two values that both need to be changed in an atomic way, you are absolutely stuck. You can't do it with atomic values. And it's quite rare that I want to change a single value in an atomic way.

In iOS and MacOS X, the three methods to use are: Protecting the change using @synchronized. Avoiding multi-threaded access by running code on a sequential queue (may be the main queue). Using mutexes.

I hope you are aware that atomicity for boolean values is rather pointless. What you have is a race condition: One thread stores a value, another reads it. Atomicity doesn't make a difference here. It makes (or might make) a difference if two threads accessing a variable at exactly the same time causes problems. For example, if a variable is incremented on two threads at exactly the same time, is it guaranteed that the final result is increased by two? That requires atomicity (or one of the methods mentioned earlier).

Sebastian makes the ridiculous claim that atomicity fixes the data race: That's nonsense. In a data race, a reader will read a value either before or after it is changed, whether that value is atomic or not doesn't make any difference whatsoever. The reader will read the old value or the new value, so the behaviour is unpredictable. All that atomicity does is prevent the situation that the reader would read some in-between state. Which doesn't fix the data race.

edited Sep 10 '16 at 18:23

answered Sep 10 '16 at 18:13

gnasher729

51,477
5
75
98

Generally atomic data types provide methods to update them atomically, test-and-modify-if-ok kind of operations. – hyde Sep 10 '16 at 18:16
2

The atomic fixes the data race, which totally makes a difference (even if it might not always show). Whether there is a bigger-picture race condition is not visible from the posted snippet. – Sebastian Redl Sep 10 '16 at 18:18
@hyde: I didn't say what they do, I said they rarely do anything that is useful for you. – gnasher729 Sep 10 '16 at 18:18
3

@gnasher729 1) You continue to use derogatory language. 2) You continue to show lack of knowledge about C++ details - "data race" has a specific technical meaning that using atomic variables fixes. – Sebastian Redl Sep 10 '16 at 18:25
1

My understanding is that the write from one thread, on boolean or not, _might_ not be ever visible in other threads if there isn't some barrier that expresses a happens-before relationship. On the example above there's nothing with happens before semantics thus the fact that it works on Intels and AMDs is coincidental implementation detail. – neverlastn Sep 10 '16 at 18:29
2

*"In general, it is very rare that the use of atomic types actually does anything useful for you in multithreaded situations"*, no, it is not rare at all to need a thread-safe single number, for example to get unique incrementing id number or to count something. And you definitely want atomic booleans for inter-thread flags, otherwise you don't have synchronization of value between threads (`volatile` is often abused for this purpose, because just forcing value to be accessed from memory is enough on some platforms/CPUs). – hyde Sep 10 '16 at 18:34
(and once again I feel the need to mention, that despite these comments of mine, I did not downvote this answer at this time) – hyde Sep 10 '16 at 18:39
@hyde: Note the intent of this answer is to advise using higher level constructs (e.g. condition variables rather than flags), not to advise eschewing synchronization completely. It just completely flubs up in its criticism of using low level constructs. – Sep 10 '16 at 18:40
You are wrong. Atomic is exactly what OP needs and `volatile` doesn't do what you seem to think it does. – Jesper Juhl Sep 10 '16 at 19:03
@Hurkyl Might be, but that's got nothing to do with what the question is asking. Also, I'd suggest, that condition variables and even mutexes are generally *wrong* level of abstraction for actual application code: If an atomic type isn't enough, then a higher level abstraction like a concurrent container or a blocking queue might be the right way. – hyde Sep 10 '16 at 19:19
in multi-threading atomic is really needed and volatile doesn't do what you seem to think it does – Mohsen Zahraee Sep 11 '16 at 04:31
Atomic classes are also used in lock-free and wait-free data structures, this is NOT something useless and can enable to build extremely powerful constructs. The general criticism about the atomic types at the beginning of the answer completely misses the spot. – Pyves Sep 11 '16 at 08:30

Are atomic types necessary in multi-threading? (OS X, clang, c++11)

3 Answers3