Results of doing += on a double from multiple threads

Question

Consider the following code:

void add(double& a, double b) {
    a += b;
}

which according to godbolt compiles on a Skylake to:

add(double&, double):
  vaddsd xmm0, xmm0, QWORD PTR [rdi]
  vmovsd QWORD PTR [rdi], xmm0
  ret

If I call add(a, 1.23) and add(a, 2.34) from different threads (for the same variable a), will a definitely end up as either a+1.23, a+2.34, or a+1.23+2.34?

That is, will one of these results definitely happen given this assembly, and a will not end up in some other state?

writing to a non atomic variable by multiple threads is undefined behavior unless you use synchronization. Any result you get will be "valid" — NathanOliver, Apr 24 '20 at 17:31
Usually, most processors can get interrupted after an instruction. So pretend that one thread is running the function and gets interrupted after the first instruction by another thread. What will happen? Search the internet for "C++ critical section" and "c++ thread safe code". — Thomas Matthews, Apr 24 '20 at 17:33
@NathanOliver But there is a resulting assembly. So C++ spec doesn't really matter here, isn't it so? — SomeWittyUsername, Apr 24 '20 at 17:36
Yeah, I specifically included the assembly because that's the behavior - it may be undefined from a C++ perspective, but this is the definition on this architecture with this GCC version, flags, etc — Thomas Johnson, Apr 24 '20 at 17:39
@ThomasJohnson That's actually not enough. If something is undefined per spec, it doesn't matter what **you think** how it should behave based on architecture, version, flags, etc. E.g., the C++ compiler theoretically might not recognize the case as valid and just spill some random output or crash. However, in your case, there is already a post-compilation assembly output. So I think now the behavior should be analyzed based on this assembly only — SomeWittyUsername, Apr 24 '20 at 17:43
@ThomasJohnson AFAIK, synchronization needs to go all the way down to the CPU. I've heard that CPU's can reorder instructions as long as they know they don't depend on each other. Without generating the thread safe code, you allow the CPU to clobber the variable and garbage is a valid result of that. — NathanOliver, Apr 24 '20 at 17:45
@NathanOliver That depends on the CPU architecture. And also I'm not sure that the result can be called garbage as long as it's deterministic — SomeWittyUsername, Apr 24 '20 at 17:47
If you don't care what the "C++" spec says, then you should remove the [C++] tag from your question. — Solomon Slow, Apr 24 '20 at 17:49
*will one of these results definitely happen given this assembly* -- You might as well stop right there. Anytime you're using assembly to justify whether behavior is defined or not, you're on the wrong path. — PaulMcKenzie, Apr 24 '20 at 17:59
@PaulMcKenzie I totally don't care whether the behavior is defined or not. I only care what will happen given this assembly — Thomas Johnson, Apr 24 '20 at 18:04
Are you doing something to stop this function from inlining? Otherwise the stand-alone asm definition isn't going to determine what happens if you actually put this in a C++ program. Other symptoms of UB can manifest like hoisting loads and sinking stores out of loops under the assumption of no data race UB. [Multithreading program stuck in optimized mode but runs normally in -O0](https://stackoverflow.com/q/58516052). See also [Atomic double floating point or SSE/AVX vector load/store on x86\_64](https://stackoverflow.com/q/45055402) — Peter Cordes, Apr 24 '20 at 21:39
You really need to quality all shared objects as volatile for "the mapping to asm" to even mean anything in your Q. W/o volatile/ 1) it's just UB 2) you can't reason about single C/C++ instr. W/ volatile 1) you can reason on individual instr 2) you get CPU semantics. — curiousguy, May 06 '20 at 00:46
@SolomonSlow The C++ spec doesn't define anything re: threads. You claim that he should leave out C++ is meaningless. — curiousguy, May 06 '20 at 00:47
@SomeWittyUsername "_If something is undefined per spec_" then ppl rely on common sense. Always. Period. — curiousguy, May 06 '20 at 00:48

higlu · Accepted Answer · 2020-04-25T10:25:04.930

3

Here is a relevant questions to me:

Does the CPU fetch the word you are dealing with in a single operation?

Some processor might allow memory access to a variable that happens to be not aligned in memory by doing two fetches one after the other - non atomically of course.

In that case, problems would arise if another thread interjects writing on that area of memory while the first thread had fetched already the first part of the word and then fetches the second part when the other thread has already modified the word.

thread 1 fetches first part of a XXXX
thread 1 fetches second part of a YYYY
thread 2 fetches first part of a XXXX
thread 1 increments double represented as XXXXYYYY that becomes ZZZZWWWW by adding b
thread 1 writes back in memory ZZZZ
thread 1 writes back in memory WWWW
thread 2 fetches second part of a that is now WWWW
thread 2 increments double represented as XXXXWWWW that becomes VVVVPPPP by adding b
thread 2 writes back in memory VVVV
thread 2 writes back in memory PPPP

For keeping it compact I used one character to represent 8 bits.

Now XXXXWWWW and VVVVPPPP are going to be representation of total different floating point values than the one you would have expected. That is because you ended up mixing two parts of two different binary representation (IEEE-754) of double variables.

Said that, I know that in certain ARM based architectures data access are not allowed (that would cause a trap to be generated), but I suspect that Intel processors do allow that instead.

Therefore, if your variable a is aligned, your result can be any of

a+1.23, a+2.34, a+1.23+2.34

if your variable might be mis-aligned (i.e. has got an address that is not a multiple of 8) your result can be any of

a+1.23, a+2.34, a+1.23+2.34 or a rubbish value

As a further note, please bear in mind that even if your environment alignof(double) == 8 that is not necessarily enough to conclude you are not going to have misalignment issues. All depends from where your particular variable comes from. Consider the following (or run it here):

#pragma push()
#pragma pack(1)
struct Packet
{
    unsigned char val1;
    unsigned char val2;
    double val3;
    unsigned char val4;
    unsigned char val5;
};
#pragma pop()


int main()
{
    static_assert(alignof(double) == 8);

    double d;
    add(d,1.23);       // your a parameter is aligned

    Packet p;
    add(p.val3,1.23);  // your a parameter is now NOT aligned

    return 0;
}

Therefore asserting alignof() doesn't necessarily guarantee your variable is aligned. If your variable is not involved in any packing then you should be OK.

Please allow me just a disclaimer for whoever else is reading this answer: using std::atomic<double> in these situations is the best compromise in term of implementation effort and performance to achieve thread safety. There are CPUs architectures that have special efficient instructions for dealing with atomic variables without injecting heavy fences. That might end up satisfying your performance requirements already.

edited Apr 25 '20 at 10:25

answered Apr 24 '20 at 19:28

higlu

433
3
9

Both x86-64 ABIs (x86-64 SystemV and Windows) have `alignof(double) = 8` so we can assume that `double& a` is aligned. Otherwise it would be undefined behaviour. In asm we can assume that it's an atomic 8-byte load, and separate 8-byte atomic store, unless the caller is hand-written asm that violates the ABI. But if it's 32-bit code then you could have a misaligned `double`; the 32-bit ABI is older and only has `alignof(double)=4`. And of course if this function inlines into its caller, the stand-alone asm definition is irrelevant. – Peter Cordes Apr 24 '20 at 21:41
the inlining perspective, although in general might be a good consideration to have in mind for the sake of relative (re)ordering of instructions, it does not make any difference in this particular case. – higlu Apr 25 '20 at 03:45
I was talking about inlining into *loops*, or other cases where it does matter. e.g. https://godbolt.org/z/oA4kd8 shows a case where optimization hoists a load out of a loop so it becomes either infinite or nothing. Or sinking a store out of a loop so it doesn't update the shared variable along the way. These are the usual consequences of data-race UB, more often being what breaks for newbies that fail to use `atomic` than actual lack of atomicity - [Multithreading program stuck in optimized mode but runs normally in -O0](https://stackoverflow.com/q/58516052) – Peter Cordes Apr 25 '20 at 04:08
Yes, I do agree with you that inlining, caching and pipeline re-ordering play a fundamental role when dealing with multithreading. However the question here was specific. In that scenario, even if the call was inlined, that wouldn't have played any role in the range of possible values assumed by the memory location at any given moment. They might have assumed different values at different moments in respect of the non-inlined alternative, but the values assumed would have still been any of the three above and only those three (in case of aligned mem location). – higlu Apr 25 '20 at 04:39
If a store gets sunk out of a loop, it's maybe possible that you'd still have the original `a` value in memory for a lot longer than you expected. Other than that, yeah I see your point. I was broadening the question to what else could go wrong to reinforce that it's much broader than just lack of atomicity. (Partly because of recent Q&As [like this one](//stackoverflow.com/questions/61324676/) where someone thought testing for lack of tearing (in debug mode) meant they didn't need `atomic`. SO has made me careful to be clear when something is still unsafe overall, for cases like this) – Peter Cordes Apr 25 '20 at 04:55
And BTW, caching is one of the few things that's *not* a problem: all real-world C++ implementations run on systems with coherent caches. Compile-time optimization of values into registers is a problem, of course, but to me "caching" means the effect of hardware CPU caches, not that. – Peter Cordes Apr 25 '20 at 04:57
So if I'm on a Linux system, I will have `alignof(double) == 8`, and so I will actually get only a+[1.23, 2.34, 1.23+2.34] – Thomas Johnson Apr 25 '20 at 06:48
Nice update! Interesting point about `struct Packet __attribute__((packed))` ; that will get the compiler to generate safe code for direct access to that `double` regardless of the ISA, I think. i.e. the compiler knows about the under-alignment of the double. But if you took a pointer to that double and passed it to another function, that would potentially be UB. ISO C says misaligned objects are UB, but ISO C doesn't have `packed` structs. So it becomes a question of how far an implementation goes to define the behaviour. – Peter Cordes Apr 25 '20 at 10:52
It's pretty definitely possible to create problems by packing a struct. e.g. I think that overrides `atomic`'s `alignas` so you could get tearing. Taking a pointer to an array of `double` inside a packed struct and passing it to another function could easily lead to a segfault *in practice* on x86, e.g. like in [Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?](//stackoverflow.com/q/47510783). And this blog: [GCC always assumes aligned pointer accesses](//trust-in-soft.com/blog/2020/04/06/gcc-always-assumes-aligned-pointers/) – Peter Cordes Apr 25 '20 at 10:55

Results of doing += on a double from multiple threads

1 Answers1