1

Why does this kind of code crash (very occasionally!) on x64?

class Object
{
   public:
   void manipulate(); // reads and writes object
   bool m_finished = false; // note this is a regular bool and NOT atomic
}

Thread 1

object->manipulate();
object->m_finished = true;
// no further access of object

Thread 2

if (object->m_finished)
   delete object;

In two separate applications I saw crashes due to this pattern (this is not the exact code, just an example of a common pattern).

The crash takes the form of heap corruption where it looks very much like the the memory occupied by 'object' was modified after it was freed. It's as if Thread 1 did:

object->m_finished = true;
object->manipulate()

and the object got deleted between the setting of m_finished and manipulate().

In the first app, the fix was to put a mutex around both setting and testing of m_finished (actually it already had a mutex around the setting part). It involved an OS bug in aio_error and is described in this question I asked

The second time I saw this, it was fixed by changing m_finished to a std::atomic.

In both cases it was an extremely rare heisencrash that would disappear with any attempt to use valgrind, ASAN etc. My assumption is that it was due to the CPUs re-ordering the writes in object->manipulate() and delete object.

Note that I am 95% sure there was no compiler level re-ordering going on, as the various methods involved were virtual functions etc.

I get that it's good to use atomics when doing inter-thread signalling (and I promise I will forever more). As far as I understand x86 (Intel and AMD, we run AMD) does not allow for store-store re-ordering. So if the deleting thread witnesses m_finished == true, all the preceding stores performed in object->manipulate() should be committed to memory right?

I understand that in general the behaviour of this code is undefined. But specifically why, on x64, does it appear that use of a mutex or atomic for m_finished is required when the cpu specs guarantee ordering of stores and loads?

  • I don't think there is any guarantee that just because `m_finished` is observed to be true, that other/previous non-atomic/non-synchronized updates will also be observed. Without proper synchronization, it's the wild west in there as far as cross-thread behavior goes :) – Jeremy Friesner Jun 08 '23 at 22:24
  • I assume that m_finished is properly initialized in code that is not shown here? – Mikel F Jun 08 '23 at 22:31
  • 3
    It's not a valid C++ code, `delete object` w/o `;` here and at other lines, it expects `object` to be a pointer, but here at `object.m_finished` it's is not a pointer. Be kind to post a [mcve]. – 273K Jun 08 '23 at 22:39
  • 1
    https://stackoverflow.com/questions/50307693/does-an-x86-cpu-reorder-instructions – Paul Sanders Jun 08 '23 at 22:48
  • 1
    So _note this is a regular bool and NOT atomic_ answers your own question – Paul Sanders Jun 08 '23 at 22:49
  • I assume that thread 2 exits after `delete object`? Because if it loops around and reads `object->m_finished` again, that's undefined behavior and gives the compiler permission to delete big chunks of code. Which it could also do if it can prove there's a race condition. – Ben Voigt Jun 08 '23 at 22:57
  • 1
    The shown code is unsafe because it is undefined behavior, due to a complete lack of thread synchronization. – Sam Varshavchik Jun 08 '23 at 23:36
  • Maybe access to `object.manipulate` involves a load operation? and then object.manipulate also does a load? those loads can be re-ordered and can cause crash if object is deleted. – A. K. Jun 09 '23 at 00:07
  • Why would you not use `std::atomic` with `.store(true, std::memory_order_release)` and `.load(acquire)`? If you're right that the compiler didn't reorder anything at compile time, that will compile to the same x86 asm you're already getting, but with well-defined behaviour according to ISO C. (x86's memory model does release / acquire for free, so indeed, if there was no compile-time reordering, the code you've shown would happen to work on x86, but not elsewhere.) – Peter Cordes Jun 09 '23 at 00:09
  • 1
    See https://lwn.net/Articles/793253/ for more about the pitfalls of using non-atomic and crossing your fingers. The Linux kernel uses `volatile` for its READ_ONCE and WRITE_ONCE macros, but other code should use `std::atomic` – Peter Cordes Jun 09 '23 at 00:09
  • 1
    The accesses and modifications of `object.m_finished` are unsynchronised, since it is not an atomic variable and there is no other synchronisation (e.g. appropriate use of a mutex) performed by the threads. So your code has undefined behaviour. This allows the compiler complete freedom as to how it orders operations such as accessing `object.m_finished`, modifying it, calling actions by `object.manipulate()`, or [if that function is inlined] any operations performed by `object.manipulate()`. The solution is to make variables atomic or use appropriate synchronisation. – Peter Jun 09 '23 at 00:10
  • 1
    There's Thread Sanitizer that allegedly catches data races. Not sure how practical it is. – HolyBlackCat Jun 09 '23 at 11:22
  • See the case in https://stackoverflow.com/questions/75177068/can-aio-error-be-used-to-poll-for-completion-of-aio-write. In this case "m_finished" was being set inside the aio thread under mutex lock, and checking "m_finished" was done in aio_error in libc, which couldn't possibly be inlined. So there is no possibility for the compiler to reorder either the setting or the checking of the flag. Yet adding a mutex lock around the check (a bugfix in newer libc) fixed it. I spent ages looking and the aio code and still don't understand why that extra mutex lock was required. – Dave Poston Jun 10 '23 at 18:07

3 Answers3

2

On modern processors there are multiple cores, each core has its own L1 cache. When memory written and then reread, this may happen well before it passes down to main memory and is the main point of caches.

When memory is shared between cores, a write by one core may not be seen by another core until later. This effectively may result in reads and writes between cores being reordered which is extactly the problem you describe.

Therefore when sharing memory between cores (threads), one will need to add a memory barrier(s) to ensure that the state of the two cache are the same. This is what std::atomic does.

ARM has a very relaxed approach meaning that memory barriers are needed in all scenario. x86/amd64 make stronger guarantees but still has the LOCK instruction prefix MFENCE instruction to deal with some pathological cases like the one you observed.

EDIT If you take a look at the spec mentioned below, it seems amd64 does a good job of enforcing proper ordering except in one case: loads may be reordered with older stores to a different location.

https://www.cs.cmu.edu/~410-f10/doc/Intel_Reordering_318147.pdf

doron
  • 27,972
  • 12
  • 65
  • 103
  • All true, but x64 says that stores are never re-ordered (at least from the point of view of reads from other cores) with respect to each other. So how does this break on x64? – Dave Poston Jun 09 '23 at 15:59
  • @DavePoston: This answer seems to be based on a misconception about cache coherence ([caches use MESI to maintain coherency](https://stackoverflow.com/a/58535118/224132), it's stuff inside CPU cores like the store buffer and hit-under-miss loads that introduces memory reordering. Barriers aren't needed for eventual visibility. Write then read on the same core would be StoreLoad (re)reordering, which isn't a problem for acq/rel. – Peter Cordes Jun 09 '23 at 19:51
  • @DavePoston: x86's memory model does give acquire / release for free, just requiring compile-time ordering (which your code fails to ask for; use `atomic` with `memory_order_acquire`/`release` like I commented under the question. If that works but non-atomic didn't, then the compiler *was* reordering something.) – Peter Cordes Jun 09 '23 at 19:52
  • 1
    `lock` prefixes are needed for atomicity of RMW instructions, regardless of ordering requirements. As well as for seq_cst stores (like `xchg [mem], eax` to store with a full barrier), but not for release stores. This code only needs acquire/release synchronization so it doesn't need a full barrier for anything, and I'd hardly call C++'s default memory order "pathological". ("unnecessarily strong for most use cases" I'd agree with, though.) – Peter Cordes Jun 09 '23 at 19:56
  • Make some corrections. The one case where intel allows reordering is: loads may be reordered with older stores to a different location – doron Jun 12 '23 at 07:32
  • That's correct, x86's memory model is program order + a store-buffer with store forwarding. So only StoreLoad reordering is possible (plus store-forwarding effects if you reload your own recent stores). acq/rel is free. To recover sequential consistency, yes, you can use `lock`ed instructions as a full barrier, or the usually-slower `mfence`. Compilers use `xchg` for SC stores rather than `mov`+`mfence`. ([Why does a std::atomic store with sequential consistency use XCHG?](https://stackoverflow.com/q/49107683)). thread_fence(sc) normally compiles to `lock add byte [rsp], 0` – Peter Cordes Jun 12 '23 at 07:42
  • 1
    Anyway, acq/rel is strong enough to make this safe in C++ (unless I'm missing something), and x86 can do that without any special instructions, just compile-time ordering. So it seems the @doron has disproved their assumption that their compiler won't reorder operations at compile time in a way that's problematic. `atomic` with `release` and `acquire` will compile to the asm they already think they're getting. They say `atomic` makes the problem go away, but haven't said if that was with the default `seq_cst` ordering, or if it was with acq/rel. – Peter Cordes Jun 12 '23 at 07:45
  • But anyway, this answer doesn't answer the question (or is an incorrect answer for this specific case). Runtime memory reordering on x86 doesn't explain the problem. It totally would on ARM or any other weakly-ordered ISA, of course, where acq/rel isn't free. – Peter Cordes Jun 12 '23 at 07:48
  • Take a look at: https://bartoszmilewski.com/2008/12/01/c-atomics-and-memory-ordering/ for a good write up on this – doron Jun 12 '23 at 08:32
2

I think, the most probable reason of crashing on your code is compiler optimizations. Does your code crash in builds without optimization (e.g. with -O0)?

For example, let see code below on godbolt.

#include <atomic>

class Object
{
public:
   std::atomic<bool> m_finished_atomic = false;
   bool m_finished_non_atomic = false;
};

void delete_without_sync(Object* ob){
    while (true){
        if (ob->m_finished_non_atomic){
            delete ob;
            break;
        }
    }
}

void delete_with_sync(Object* ob){
    while (true){
        if (ob->m_finished_atomic.load(std::memory_order::relaxed)){
            delete ob;
            break;
        }
    }
}

As you see, both methods do same thing, std::memory_order::relaxed means nothing for x86_64 CPU. But for compiler, this methods are VERY different. Let's see assembly:

delete_without_sync(Object*):        # @delete_without_sync(Object*)
        jmp     operator delete(void*)@PLT                      # TAILCALL
delete_with_sync(Object*):           # @delete_with_sync(Object*)
.LBB1_1:                                # =>This Inner Loop Header: Depth=1
        movzx   eax, byte ptr [rdi]
        test    al, 1
        je      .LBB1_1
        jmp     operator delete(void*)@PLT                      # TAILCALL

As you can see, delete_without_sync was optimized away by compiler to direct delete of our struct without any waiting while delete_with_sync repeatedly tests variable and deletes object only when it is true. So why?

From compiler point of view, in delete_without_sync, field m_finished_non_atomic cannot change because it thinks that it cannot be shared between threads because there is no any synchronization. So compiler converts code to something like this:

void delete_without_sync(Object* ob){
    if (ob->m_finished_non_atomic) {
       delete ob;
       return;
    }
    while (true) {}
}

Then it sees that there is an infinite loop without any side effects. Such loops are considered undefined behaviour so they cannot be reached. Therefore, field m_finished_non_atomic must be true so compiler removes even reads from it.

And now we have use after free because of data race.

In delete_with_sync compiler cannot assume that variable doesn't change so it generated "honest" loop we asked.

In large project, there can be a lot of things happen around which is hard to track for a compiler. Maybe there was some loop which wait until boolean flag updates, or some other undefined behaviour that occur because compiler assumes that variable cannot change. It is hard to track for a person compared to a program like compiler.

Another thing that could happen, that compiler just moved object->m_finished = true before manipulate() call because it noticed that manipulate never updates or reads it.

In conclusion: I think that it is compiler optimizations trigger that behaviour here, not CPU itself. This is a reason why you should always use synchronization at some point to prevent miscompilations. C++ memory model is way more relaxed compared to CPUs exactly to make such optimizations available.

  • Whilst I am sure this is not the cause of the issues I've seen - because in my code, the 'while' loop is external to the function that does the 'delete' - this is a very interesting result. upvoted. – Dave Poston Jul 24 '23 at 10:38
  • @DavePoston Are you sure that compiler cannot inline that loop?All C++ compiler agressively inline functions because it allows other optimizations. – Angelicos Phosphoros Jul 24 '23 at 21:42
  • Yes, see my first case in https://stackoverflow.com/questions/75177068/can-aio-error-be-used-to-poll-for-completion-of-aio-write. The methods that manipulated the object and checked for completion were in the libc DLL so definitely couldn't be inlined! Also I just used a while loop to simplify the example code - in both my cases it's actually a polling timer that releases the object, where external code calls a callback, that then checks if the object can be freed. But in my second case there could still have been some inlining and I need to check what it was doing, it's a good point – Dave Poston Jul 26 '23 at 11:38
0

It seems like you tried using a mutex but maybe put it in the wrong spots.

class Object {
public:
    void manipulate(); // reads and writes object
    bool m_finished; // note this is a regular bool and NOT atomic
    std::mutex m_mutex; // add a mutex for synchronization
};

// Thread 1
object.manipulate()
{
    std::lock_guard<std::mutex> lock(object.m_mutex);
    object.m_finished = true;
}
// no further access of object

// Thread 2
unknownFunction()
{
    std::lock_guard<std::mutex> lock(object.m_mutex);
    if (object.m_finished)
        delete object;
}

By locking in manipulate as well as where you check for if it is finished you prevent it from being deleted before manipulate completes.

The C++ standard does not guarantee the order of execution of operations in different threads unless explicit synchronization mechanisms, such as mutexes or atomic operations, are used. Without proper synchronization, the behavior of the code is undefined.

M Y
  • 1,831
  • 4
  • 24
  • 52
  • 1
    As I mentioned, use of a mutex or atomic fixed the problem. But my question is why they're required, given x64 architecture apparently guarantees memory ordering & load-store causality etc – Dave Poston Jun 09 '23 at 10:39