Race Condition with writing same value in C++?

Question

Is there any issue with having a race condition in your code when the operation is writing a single constant value? For example if there is a parallel loop that populated a seen array for every value that is in another array arr (assuming no issues with out of bounds indices). the critical section could be the below code:

//parallel body with index i
int val = arr[i];
seen[val] = true;

Since the only value being written is true does that make the need for a mutex not necessary, and possibly detrimental to performance? Even if threads stomp on each other they would just be filling in the address with the same value, correct?

Yes. If you don't introduce memory barrier then one "thread" may not see that the other "thread" has altered the value (As the value is cached in memory local to the processor which is not visible to the other processor). So proc1 may have updated the value, but proc2 may never see the update and continue processing what its doing without stopping. So the problem is not a race condition but a synchronization issue. — Martin York, Sep 17 '18 at 16:08
If `seen` is a `vector` there are problems as per https://stackoverflow.com/questions/33617421/write-concurrently-vectorbool. In terms of guarantees, you could argue a very silly compiler would separate the write into store 0 then increment 1... which would also cause a race condition. — Lawrence, Sep 17 '18 at 17:22

score 7 · Accepted Answer · answered Sep 17 '18 at 18:00

The C++ memory model does not give you a free pass for writing the same value.

If two threads are writing to a non-atomic object without synchronization, that is simply a race condition. And a race condition means your program executes undefined behavior. And undefined behaviour occuring anywhere in your program's execution means that the behavior of your program, both before and after the point of undefined behavior, is not restricted by the C++ standard in any way.

A given compiler is free to provide a more free memory model. I'm unaware of any that do.

One thing you must understand is that C++ is not an assembler macro language. It doesn't have to produce the naive assembler you imagine in your head. C++ instead tries to make it easy for your compiler to produce assembler, which is a very different thing.

Compilers can and do determine "if X happens, we get undefined behavior; so I'll optimize around the fact that X does not happen" when generating code. In this case here, the compiler can prove that program with defined behavior could ever have the same val in two different unsynchrnoized threads.

All of this can happen long before any assembly is generated.

And at the assembly level, some hardware might do funny things with unaligned assignment to multi-byte values. Some hardware could (in theory; I'm unaware of any in practice) raise traps when instructions that claim to be single-thread writes occur in two different cores on the same bytes.

So this is UB in C++. And once you have UB, you have to audit the assembly code produced by your program in everywhere the compiler who touches this can see. If you do LTO, that means in your entire program, at least everywhere that calls or interacts with your code that does UB, to an unclear distance.

Just write defined behavior. And only if this turns out to be a mission critical performance bottleneck should you spend more effort on optimizing it (first faster defined behavior, and only if that fails do you even consider UB).

score -1 · Answer 2 · answered Sep 17 '18 at 16:18

There may be an architecture-dependent constraint requiring your seen array elements to be separated by a certain amount to prevent competing threads from destroying values that collided in the same machine word (or cache row, even).

That is, if seen is defined as bool seen[N]; then seen is N bytes long and each element is directly adjacent to its neighbor. If one thread changes element 0 and another thread changes element 2, both of these changes occur in the same 64-bit machine-word. If these two changes are made concurrently by different cores (or even on different CPUs of a multi-cpu system), they will attempt to resolve the collision as an entire 64-bit machine word (or larger in some cases). The result of this will be that one of the trues that was written will be turned back to its previous state (probably false) by the winning thread's update to a neighboring element.

If instead, you define seen as an array of structs, each of which is as large as a cache row, then you may have competing threads mash a bool value within that struct... but this is risky because not all CPUs will share the same cache collision validation strategies, row-sizes, and the likes... and inevitably, there will be a CPU that it will fail on.

I don't think this is correct. Are you saying that if I declare `bool a,b;` and then modify `a` in one thread and `b` in another (without any synchronization) then the compiler has to allocate them in different cache lines? — Martin Bonner supports Monica, Sep 17 '18 at 16:39
Sorry... you are correct of most general purpose processors... non-coherent cache multi-core processor design is becoming more common in effort to widen the number of cores... but for coherent cache, my entire answer is incorrect. — Brian Ellis, Sep 18 '18 at 00:54

Race Condition with writing same value in C++?

2 Answers2

Linked