Why does code mutating a shared variable across threads apparently NOT suffer from a race condition?

Question

I'm using Cygwin GCC and run this code:

#include <iostream>
#include <thread>
#include <vector>
using namespace std;

unsigned u = 0;

void foo()
{
    u++;
}

int main()
{
    vector<thread> threads;
    for(int i = 0; i < 1000; i++) {
        threads.push_back (thread (foo));
    }
    for (auto& t : threads) t.join();

    cout << u << endl;
    return 0;
}

Compiled with the line: g++ -Wall -fexceptions -g -std=c++14 -c main.cpp -o main.o.

It prints 1000, which is correct. However, I expected a lesser number due to threads overwriting a previously incremented value. Why does this code not suffer from mutual access?

My test machine has 4 cores, and I put no restrictions on the program that I know of.

The problem persists when replacing the content of the shared foo with something more complex, e.g.

if (u % 3 == 0) {
    u += 4;
} else {
    u -= 1;
}

Intel CPUs have some amazing internal "shoot down" logic to preserve compatibility with very early x86 CPUs used in SMP systems (like dual Pentium Pro machines). Lots of failure conditions that we are taught are possible almost never actually happen on x86 machines. So say a core goes to write `u` back to memory. The CPU will actually do amazing things like notice that the memory line for `u` isn't in the CPU's cache and it will restart the increment operation. This is why going from x86 to other architectures can be an eye opening experience! — David Schwartz, Jan 23 '17 at 22:05
Maybe still too quick. You need to add code to ensure that the thread yields before it does anything to ensure that other threads get launched before it completes. — Rob K, Jan 23 '17 at 22:10
As has been elsewhere noted, the thread code is so short it may well be executed before the next thread is queued. How about 10 threads that place u++ in a 100 count loop. And a short delay within for prior to start of the loop (or a global "go" flag to start them all at the same time) — RufusVS, Jan 24 '17 at 18:45
Actually, spawning the program repeatedly in a loop eventually shows that it breaks : something like `while true; do res=$(./a.out); if [[ $res != 1000 ]]; then echo $res; break; fi; done;` prints 999 or 998 on my system. — Daniel Kamil Kozar, Jan 24 '17 at 18:50

score 265 · Accepted Answer · edited Jul 17 '17 at 21:17

265

foo() is so short that each thread probably finishes before the next one even gets spawned. If you add a sleep for a random time in foo() before the u++, you may start seeing what you expect.

edited Jul 17 '17 at 21:17

Peter Mortensen

30,738
21
105
131

answered Jan 23 '17 at 22:06

Rob K

8,757
2
32
36

51

This indeed changed the output in the expected way. – mafu Jan 23 '17 at 22:33
50

I would note that this is in general a rather good strategy for exhibiting race conditions. You should be able to inject a pause between any two operations; if not, there's a race condition. – Matthieu M. Jan 24 '17 at 08:01
We had just this issue with C# recently. Code almost never failed usually, but the recent addition of an API call in between introduced enough delay to make it consistently change. – Obsidian Phoenix Jan 25 '17 at 14:01
@MatthieuM. Doesn't Microsoft have an automated tool that does exactly that, as a method of both detecting race conditions and making them reliably reproducible? – Mason Wheeler Jan 25 '17 at 18:51
1

@MasonWheeler: I work nigh exclusively on Linux, so... dunno :( – Matthieu M. Jan 25 '17 at 19:02
@MasonWheeler Yes, they called it CHESS. It could be implemented on linux, but they had no fiscal reason to do so. It's actually a brilliant little algorithm. – Cort Ammon Jan 25 '17 at 22:30
I don't know if this is a record for SO, but this is one of the most upvoted answers I've seen in proportion to the answer's brevity (I've got no problem with it -- yeah I upvoted :) ) – PaulMcKenzie Jan 27 '17 at 17:48

Vality · Answer 2 · 2017-01-25T17:35:38.220

59

It is important to understand a race condition does not guarantee the code will run incorrectly, merely that it could do anything, as it is an undefined behavior. Including running as expected.

Particularly on X86 and AMD64 machines race conditions in some cases rarely cause issues as many of the instructions are atomic and the coherency guarantees are very high. These guarantees are somewhat reduced on multi processor systems where the lock prefix is needed for many instructions to be atomic.

If on your machine increment is an atomic op, this will likely run correctly even though according to the language standard it is Undefined Behavior.

Specifically I expect in this case the code may be being compiled to an atomic Fetch and Add instruction (ADD or XADD in X86 assembly) which is indeed atomic in single processor systems, however on multiprocessor systems this is not guaranteed to be atomic and a lock would be required to make it so. If you are running on a multiprocessor system there will be a window where threads could interfere and produce incorrect results.

Specifically I compiled your code to assembly using https://godbolt.org/ and foo() compiles to:

foo():
        add     DWORD PTR u[rip], 1
        ret

This means it is solely performing an add instruction which for a single processor will be atomic (though as mentioned above not so for a multi processor system).

edited Jan 25 '17 at 17:35

answered Jan 23 '17 at 22:06

Vality

6,577
3
27
48

But only when compiled with `-O3`. OP could have told us how he compiled it. – Peter VARGA Jan 23 '17 at 22:28
@AlBundy It also produces the same assembly for `-O2` and `-O1`, only `-O0` causes it to produce completely different (and ironically much more confusing) output. – Vality Jan 23 '17 at 22:29
41

It's important to remember that "running as intended" is a permissible outcome of undefined behavior. – Mark Jan 23 '17 at 23:34
3

As you indicated, this instruction is not *atomic* on an SMP machine (which all modern systems are). Even `inc [u]` is not atomic. The `LOCK` prefix is required to make an instruction truly atomic. The OP is simply getting lucky. Recall that even though you're telling the CPU "add 1 to the word at this address", the CPU still has to fetch, increment, store that value and another CPU can do the same thing simultaneously, causing the result to be incorrect. – Jonathon Reinhart Jan 24 '17 at 01:09
2

I down-voted, but then I re-read your question and realized that your atomicity statements were assuming a single CPU. If you edit your question to make this more clear (when you say "atomic", be clear that this is only the case on a single CPU), then I will be able to remove my down-vote. – Jonathon Reinhart Jan 24 '17 at 01:10
@JonathonReinhart I have made an attempt to make my intent clearer regarding multi processor systems and the reduced coherency guarantees for add there. Please let me know if that is sufficiently clear, and if not feel free to edit further to make it so. – Vality Jan 24 '17 at 01:20
3

Downvoted, i find this claim a bit meh *"Particularly on X86 and AMD64 machines race conditions in some cases rarely cause issues as many of the instructions are atomic and the coherency guarantees are very high."* The paragraph should start making the explicit assumption that you're focusing on single core. Even so, multi-core architectures are de-facto standard nowadays in consumer devices that I would consider this a corner case to explain last, rather than first. – Patrick Trentin Jan 24 '17 at 06:49
2

Hmm, no. This is wrong. `add DWORD PTR u[rip], 1` or **any instruction with a memory operand is *not* atomic on x86**. `add rax, 1` (or any instruction with a register operand) would be atomic, but when you have a memory operand, the instruction is decomposed into ~3 micro-ops (a load, an add, and a store in this case), which means it is not atomic. I guess you kind of try and weasel out of this by saying it is only atomic on a single-CPU system, but that doesn't make much sense. It's still *not atomic*, by definition, and with multiple threads, you're still hosed without a `lock` prefix. – Cody Gray - on strike Jan 24 '17 at 07:10
@CodyGray I certainly take your point on multi processor systems being common now, but I none the less believe this is indeed atomic on a single processor. Even though this gets broken down into multiple micro ops, the CPU is not preemptable between their execution meaning another thread will never take control. – Vality Jan 24 '17 at 07:16
1

I'm not the one who said that multi-processor systems are common now, that was Patrick. Even on a single-processor system, instructions with a memory destination (other than simple stores) are *not* atomic, and you will run into race conditions if you have multiple threads. If you have a single-processor system *and* a single-threaded process, then the entire concept of atomicity is meaningless—every operation is trivially atomic. – Cody Gray - on strike Jan 24 '17 at 07:18
@PatrickTrentin I certainly accept your reasoning and you are welcome to -1 if you feel that way. Though mind I did mention explicitly several times in the post that I was considering a single processor system. And stated that the instruction lacks atomicity on a multi processor board. Still you make a fair point. – Vality Jan 24 '17 at 07:19
@CodyGray I think you misunderstand me. Even if there are multiple threads on a single processor system, for another thread to mess up the the operation of the other it has to preempt said thread. This is only possible at certain points, and not during the execution of an instruction. (Assuming I havnt badly misunderstood X86) – Vality Jan 24 '17 at 07:21
Oh, right. I see what you're saying. I replied to your last comment without seeing your edit. Yes, I suppose that's true; execution can't be pre-empted in the middle of these micro-ops. Still, I think the answer is misleading. This is a relatively tricky area that many programmers misunderstand. – Cody Gray - on strike Jan 24 '17 at 07:22
1

@CodyGray really though, thank you for the commentary, I take your point that this might be misleading to some programmers. Truthfully I am actually an embedded developer, mostly working on small single core 80386, POWER and ARM boards so that is where my area of detailed analysis and expertise most lays. Specifically my comments regarding X86 having the most coherency was compared to RISC architectures where being preempted during execution of what would be a single instruction on a CISC CPU is both a real issue and common occurance. In my experience therefore X86 tends to be much more... – Vality Jan 24 '17 at 07:31
Likely to "hide" these problems by working most or all the time even in broken code while RISC CPUs will fail or act erratically frequently. I will try and make another edit when I have time to go into more detail on all this. – Vality Jan 24 '17 at 07:31
3

Oh, definitely. x86 has tons of backwards-compatibility…stuff to make sure that incorrectly-written code works to the extent possible. It was a really big deal when the Pentium Pro introduced out-of-order execution. Intel wanted to make sure that the installed base of code worked *without* needing to be recompiled specifically for their new chip. x86 started out as a CISC core, but has internally evolved into a RISC core, although it still presents and behaves in many ways as CISC from a programmer's perspective. For more, see Peter Cordes's answer [here](http://stackoverflow.com/q/39393850). – Cody Gray - on strike Jan 24 '17 at 07:35
2

@Vality sure you repeat it several times, but that's only starting from the third paragraph. Such an important detail should be the opening statement of your answer, not hidden half-way in, because it makes the first half confusing for no good reason whatsoever. – Patrick Trentin Jan 24 '17 at 08:20
1

Just a side note that somepeople may find relevant. Windows (the most common operating system on x86) by default favors sending instructions from each thread to the core on which its parent thread is operating, unless explicitly told not to. So even in a multicore system, if its running windows, race conditions are likely to be avoided assuming default configurations. – kingfrito_5005 Jan 24 '17 at 21:58
@PatrickTrentin I added another comment to the second paragraph to that effect, I didnt want to edit the first as that was a very general statement which was not at all architecture specific. I hope that's OK. – Vality Jan 25 '17 at 00:32
@CodyGray: Actually, `inc [mem]` on a single *core* machine is atomic *with respect to other threads*. An interrupt + context-switch can happen either before or after the instruction, but not in the middle. On a UP machine, you only need `lock inc [mem]` to make it atomic for a DMA or MMIO observer. Of course, compilers are free to compile `num++` into more than one instruction, so you need to use C++11 `fetch_add`! Fun fact: [Linux booting on a uniprocessor system will patch the `lock` prefixes to `nop`](https://lwn.net/Articles/164121/), except in MMIO functions. – Peter Cordes Jul 12 '17 at 21:29
Agreed that this answer is of little value on this question, though, since the OP is clear that they have a quad-core CPU. Anyway, this atomic-on-single-thread issue has been explained better in other answers, e.g. [supercat's on the atomic `num++` question](https://stackoverflow.com/a/39396781/224132) – Peter Cordes Jul 12 '17 at 21:29

score 20 · Answer 3 · edited Jun 08 '17 at 12:52

20

I think it is not so much the thing if you put a sleep before or after the u++. It is rather that operation u++ translates to code that is - compared to the overhead of spawning threads that call foo - very quickly performed such that it is unlikely to get intercepted. However, if you "prolong" the operation u++, then the race condition will become much more likely:

void foo()
{
    unsigned i = u;
    for (int s=0;s<10000;s++);
    u = i+1;
}

result: 694

BTW: I also tried

if (u % 2) {
    u += 2;
} else {
    u -= 1;
}

and it gave me most times 1997, but sometimes 1995.

edited Jun 08 '17 at 12:52

gsamaras

71,951
46
188
305

answered Jan 23 '17 at 22:29

Stephan Lechner

34,891
4
35
58

1

I would expect on any vaguely sane compiler that whole function would be optimized to the same thing. I am surprised it wasn't. Thank you for the interesting result. – Vality Jan 23 '17 at 22:31
This is exactly correct. Many thousands of instructions need to run before the next thread starts executing the tiny function in question. When you make the execution time in the function closer to the thread creation overhead, you see the impact of the race condition. – Jonathon Reinhart Jan 24 '17 at 01:05
@Vality: I also expected it to delete the spurious for-loop under O3 optimization. It doesn't? – user21820 Jan 24 '17 at 08:23
How could `else u -= 1` ever be executed? Even in a parallel environment the value should never not fit `%2`, doesn't it? – mafu Jan 24 '17 at 14:51
2

from the output, it looks like `else u -= 1` is executed once, the first time foo() is called, when u == 0. The remaining 999 times u is odd and `u += 2` is executed resulting in u = -1 + 999 * 2 = 1997; i.e. the correct output. A race condition sometimes causes one of the +=2 to be overwritten by a parallel thread and you get 1995. – Luke Jan 24 '17 at 18:08

score 7 · Answer 4 · edited Jan 24 '17 at 00:14

7

It does suffer from a race condition. Put usleep(1000); before u++; in foo and I see different output (< 1000) each time.

edited Jan 24 '17 at 00:14

isanae

3,253
1
22
47

answered Jan 23 '17 at 22:17

juf

204
3
6

score 6 · Answer 5 · answered Jan 25 '17 at 21:13

The likely answer to why the race condition didn't manifest for you, though it does exist, is that foo() is so fast, compared to the time it takes to start a thread, that each thread finishes before the next can even start. But...
Even with your original version, the result varies by system: I tried it your way on a (quad-core) Macbook, and in ten runs, I got 1000 three times, 999 six times, and 998 once. So the race is somewhat rare, but clearly present.
You compiled with '-g', which has a way of making bugs disappear. I recompiled your code, still unchanged but without the '-g', and the race became much more pronounced: I got 1000 once, 999 three times, 998 twice, 997 twice, 996 once, and 992 once.
Re. the suggestion of adding a sleep -- that helps, but (a) a fixed sleep time leaves the threads still skewed by start time (subject to timer resolution), and (b) a random sleep spreads them out when what we want is to pull them closer together. Instead, I'd code them to wait for a start signal, so I can create them all before letting them get to work. With this version (with or without '-g'), I get results all over place, as low as 974, and no higher than 998:
```
#include <iostream>
#include <thread>
#include <vector>
using namespace std;

unsigned u = 0;
bool start = false;

void foo()
{
    while (!start) {
        std::this_thread::yield();
    }
    u++;
}

int main()
{
    vector<thread> threads;
    for(int i = 0; i < 1000; i++) {
        threads.push_back (thread (foo));
    }
    start = true;
    for (auto& t : threads) t.join();

    cout << u << endl;
    return 0;
}
```

Just a note. The `-g` flag does not in any way "make bugs disappear." The `-g` flag on both GNU and Clang compilers simply adds debug symbols to the compiled binary. This allows you to run diagnostic tools like GDB and Memcheck on your programs with some human readable output. For example when Memcheck is run over a program with a memory leak it will not tell you the line number unless the program was built using the `-g` flag. — MS-DDOS, Jan 26 '17 at 00:50
Granted, bugs hiding from the debugger is usually more a matter of compiler optimization; I should have tried, and said, "using `-O2` _instead_ of `-g`". But that said, if you've never had the joy of hunting a bug that would manifest _only_ when compiled _without_ `-g`, consider yourself fortunate. It _can_ happen, with some of the very nastiest of subtle aliasing bugs. I _have_ seen it, though not recently, and I could believe maybe it was a quirk of an old proprietary compiler, so I'll believe you, provisionally, about modern versions of GNU and Clang. — dgould, Jan 27 '17 at 19:35
`-g` doesn't stop you from using optimizations. e.g. `gcc -O3 -g` makes the same asm as `gcc -O3`, but with debug metadata. gdb will say "optimized out" if you try to print some variables, though. `-g` could maybe change the relative locations of some things in memory, if any of the stuff it adds is part of the `.text` section. It definitely takes space in the object file, but I think after linking it all ends up at one end of the text segment (not section), or not part of a segment at all. Maybe could affect where things are mapped for dynamic libraries. — Peter Cordes, Jul 12 '17 at 21:41

Why does code mutating a shared variable across threads apparently NOT suffer from a race condition?

5 Answers5