7

I'm using a lock free stack (via tagged pointers) to manage a pool of small blocks of memory. The list nodes are created and destroyed in-place when the blocks are inserted into, and removed from, the pool.

This is a very simplified test program, which only pops from the stack. So, no ABA problem and no tagged pointers. It is sufficient to demonstrate the race I'm running into:

#include <atomic>
#include <list>
#include <thread>
#include <type_traits>

struct Node {
  Node() = default;
  Node(Node *n) { next.store(n); }
  std::atomic<Node *> next;
};

using Memory = std::aligned_storage_t<sizeof(Node)>;

struct Stack {
  bool pop_and_use() {
    for (Node *current_head = head.load(); current_head;) {
      Node *next = current_head->next.load(); // READ RACE
      if (head.compare_exchange_weak(current_head, next, std::memory_order_seq_cst)) {
        current_head->~Node();
        Memory *mem = reinterpret_cast<Memory *>(current_head);
        new (mem) int{0}; // use memory with non-atomic write (WRITE RACE)
        return true;
      }
    }
    return false;
  }
  void populate(Memory *mem, int count) {
    for (int i = 0; i < count; ++i) {
      head = new (mem + i) Node(head.load());
    }
  }
  std::atomic<Node *> head{};
};

int main() {
  Memory storage[10000];
  Stack test_list;
  test_list.populate(storage, 10000);
  std::thread worker([&test_list]() {
    while (test_list.pop_and_use()) {
    };
  });
  while (test_list.pop_and_use()) {};
  worker.join();
  return 0;
}

Thread sanitizer reports the following error:

clang++-10 -fsanitize=thread tsan_test_2.cpp -o tsan_test_2 -O2 -g2 -Wall -Wextra && ./tsan_test_2
LLVMSymbolizer: error reading file: No such file or directory
==================
WARNING: ThreadSanitizer: data race (pid=35998)
  Atomic read of size 8 at 0x7fff48bd57b0 by thread T1:
    #0 __tsan_atomic64_load <null> (tsan_test_2+0x46d88e)
    #1 std::__atomic_base<Node*>::load(std::memory_order) const /usr/bin/../lib/gcc/x86_64-linux-gnu/8/../../../../include/c++/8/bits/atomic_base.h:713:9 (tsan_test_2+0x4b3e6c)
    #2 std::atomic<Node*>::load(std::memory_order) const /usr/bin/../lib/gcc/x86_64-linux-gnu/8/../../../../include/c++/8/atomic:452:21 (tsan_test_2+0x4b3e6c)
    #3 Stack::pop_and_use() /home/BOSDYN/akhripin/tmp/tsan_test_2.cpp:17:39 (tsan_test_2+0x4b3e6c)
    #4 main::$_0::operator()() const /home/BOSDYN/akhripin/tmp/tsan_test_2.cpp:40:22 (tsan_test_2+0x4b3e6c)
    #5 void std::__invoke_impl<void, main::$_0>(std::__invoke_other, main::$_0&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/8/../../../../include/c++/8/bits/invoke.h:60:14 (tsan_test_2+0x4b3e6c)
    #6 std::__invoke_result<main::$_0>::type std::__invoke<main::$_0>(main::$_0&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/8/../../../../include/c++/8/bits/invoke.h:95:14 (tsan_test_2+0x4b3e6c)
    #7 decltype(std::__invoke(_S_declval<0ul>())) std::thread::_Invoker<std::tuple<main::$_0> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) /usr/bin/../lib/gcc/x86_64-linux-gnu/8/../../../../include/c++/8/thread:244:13 (tsan_test_2+0x4b3e6c)
    #8 std::thread::_Invoker<std::tuple<main::$_0> >::operator()() /usr/bin/../lib/gcc/x86_64-linux-gnu/8/../../../../include/c++/8/thread:253:11 (tsan_test_2+0x4b3e6c)
    #9 std::thread::_State_impl<std::thread::_Invoker<std::tuple<main::$_0> > >::_M_run() /usr/bin/../lib/gcc/x86_64-linux-gnu/8/../../../../include/c++/8/thread:196:13 (tsan_test_2+0x4b3e6c)
    #10 <null> <null> (libstdc++.so.6+0xbd6de)

  Previous write of size 4 at 0x7fff48bd57b0 by main thread:
    #0 Stack::pop_and_use() /home/BOSDYN/akhripin/tmp/tsan_test_2.cpp:21:9 (tsan_test_2+0x4b3d5d)
    #1 main /home/BOSDYN/akhripin/tmp/tsan_test_2.cpp:43:20 (tsan_test_2+0x4b3d5d)

  Location is stack of main thread.

  Location is global '??' at 0x7fff48bad000 ([stack]+0x0000000287b0)

  Thread T1 (tid=36000, running) created by main thread at:
    #0 pthread_create <null> (tsan_test_2+0x4246bb)
    #1 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) <null> (libstdc++.so.6+0xbd994)
    #2 __libc_start_main /build/glibc-OTsEL5/glibc-2.27/csu/../csu/libc-start.c:310 (libc.so.6+0x21b96)

SUMMARY: ThreadSanitizer: data race (/home/BOSDYN/akhripin/tmp/tsan_test_2+0x46d88e) in __tsan_atomic64_load
==================
ThreadSanitizer: reported 1 warnings

The problem arises when the two threads read the same value of current_head, but one of them completes the pop and overwrites the node before the other has a chance to read current_head->next.

This is similar to the problem discussed here: Why would 'deleting' nodes in this lock-free stack class would cause race condition? except the memory is not actually being deallocated.

I know that from the machine's perspective, this race is benign -- if the read race occurs, the compare-and-swap will not succeed -- but I think this is still getting into undefined behavior territory in C++.

  1. Is there any way to write this code without getting a race condition?
  2. Is there any way to annotate the code to make thread sanitizer ignore it? I experimented with __tsan_acquire and __tsan_release but could not find something that consistently worked.

Update I'm pretty convinced that there is no way to perform the atomic read safely in standard C++ -- the object just doesn't exist any more. But -- can I go from relying on undefined behavior to relying on implementation-defined behavior? What's the best I could do, given typical architectures and toolchains (x86/ARM, gcc/clang)?

Update 2 One implementation-specific approach that seems to work is to replace the load with inline assembly:

inline Node *load_next_wrapper(Node *h) {
  Node *ret;
  asm volatile("movq (%1), %0" : "=r"(ret) : "r"(&h->next));
  return ret;
}

This is both architecture and compiler specific -- but I think this does replace "undefined" behavior with "implementation-defined" behavior.

Alex K
  • 171
  • 3
  • 1
    I think the crux of the problem is not the write, but that one thread is reading from an atomic after another thread has _destroyed it_. It's fundamentally the same problem as the deallocation. – Mooing Duck May 20 '20 at 18:02
  • The only solutions I can think of are (fixing undefined behavior) using a lock, or (keeping undefined behavior) inserting a `std::this_thread::yield()` after the CAS. – Mooing Duck May 20 '20 at 18:08
  • 3
    "How do I avoid or suppress the race in this lock free stack?" Is a rather worrying question. I mean, the "avoid" bit is fine, but the "suppress" bit is scarry. Why would you *ever* want to suppress a warning or error about a race condition? That would just make you blind to the problem while still leaving it in your code, ready to explode at any time. Suppression is *very rarely* the way you want to go. – Jesper Juhl May 20 '20 at 18:10
  • 4
    [Further discussion here](https://nullprogram.com/blog/2014/09/02/) under "Hazard Pointers and Garbage Collection". It's hard. – Raymond Chen May 20 '20 at 18:14
  • 3
    @MooingDuck "keeping undefined behaviour" is not really an option, ever. Once your program has UB *anywhere*, the compiler is free to generate garbage for *any* part of your program. The entire thing is undefined at that point, not just the statements containing the UB (and yes, compilers *do* exploit this freedom). – Jesper Juhl May 20 '20 at 18:15
  • @MooingDuck -- that's a fair point, as far as the well-definedness of the code is concerned. I don't think that can be avoided, using `std::atomic`... So is having well defined C++ out of the question here? (Side note: thread sanitizer still complains if I switch to bare pointers and __atomic builtins -- so the question of how to quiet it seems to be a separate one) – Alex K May 20 '20 at 18:18
  • 3
    @AlexK "from the machine's perspective, this race is benign" - But, you are not programming the machine. You are programming the abstract machine specified by the C++ standard. And once you do something that is undefined according to the standard when working with that abstract machine, you just gave the compilers optimizer a nuclear missile and a license to kill - and most modern compilers will take full advantage of that and optimize accordingly - and you probably won't like the result. What the *physical* machine guarantees is not that relevant once the compiler has mangeled your program. – Jesper Juhl May 20 '20 at 18:26
  • @JesperJuhl right -- I'm just saying that idea isn't fundamentally wrong; I just don't know how to write it correctly in C++. If that's impossible generally -- are there, at least, some compiler-specific ways to do so? – Alex K May 20 '20 at 18:46
  • 1
    @AlexK Personally, I'd just use a lock. It gives the guarantee you need/want and (at least in the domain I work in) won't actually matter much performance wise. Test/benchmark everything for *your* use-case though! – Jesper Juhl May 20 '20 at 19:03
  • @JesperJuhl as a general rule, I'd agree, but in this case, it does matter. I think if I don't come up with an answer here, I will just not destroy or overwrite the Nodes, by reserving extra space for them. – Alex K May 20 '20 at 19:11
  • @MooingDuck it looks like the undefined behavior might kick in not at destruction but at the construction of the new object: "For all other objects (class objects initialized by a trivial default constructor, non-class objects, arrays of those, etc.), lifetime begins when the properly-aligned storage for the object is allocated and ends when the storage is deallocated or **reused by another object**" Doesn't actually help, though. – Alex K May 20 '20 at 19:26
  • 1
    @AlexK: `std::atomic` does not have a default constructor, so you're referring to the wrong section. Lifetime ends at destruction. – Mooing Duck May 20 '20 at 19:50
  • @AlexK you simplified your code too much. Why are you destroying the object? What are you doing with it afterwards if you don't release the memory? Please provide a minimal reproducible example. – mpoeter May 20 '20 at 21:43
  • @mpoeter Another object of some type will be constructed in the same location. Eventually, it will be destroyed and the memory would return to the pool. Adding those elements doesn't add anything but extra lines to the example. – Alex K May 20 '20 at 21:55
  • As clearly shown in the error message, you write (`new (mem) int { 0 };` is 4 bytes and of the wrong type. As far as I know, this is undefined behavior if the same data is read as a `Node *`. The message seems to indicate that you have read that data as if it was a pointer (8 bytes). My opinion is that for real application, you should consider either using a free-lock structure made by an expert or use regular lock. **C++ Concurrency In Action** is a must read book if you are serious with multithreading. – Phil1970 May 21 '20 at 01:28
  • @JesperJuhl In fact, in the vast majority of cases, locks give better performance under high load because they cause contending threads to be less likely to be scheduled concurrently. – David Schwartz May 21 '20 at 17:21

1 Answers1

0

Tagged pointers are fine if you simply want to reuse the same nodes in the data structure, i.e., you don't destroy it, but simply put it on a free-list so it can be reused when you need a new node in the next push operation. In this case tagged pointers are sufficient to prevent the ABA problem, but they are no solution to the _ memory reclamation problem_ that you face here.

Another object of some type will be constructed in the same location. Eventually, it will be destroyed and the memory would return to the pool.

This is the real issue - you are destroying the object and reusing the memory for something else. As many others have already explained in the comments this causes undefined behavior. I am not sure what you mean by "return to the pool" - return to the memory manager? Ignoring the UB for a moment - you are right that this race is usually benign (from the hardware perspective), but if you do release the memory at some point, you could actually run into a segmentation fault (e.g. in case the memory manager decides to return the memory to the OS).

How to avoid undefined behavior in this scenario

If you want to reuse the memory for something else, you have to use a memory reclamation scheme like lock-free reference counting, hazard pointers, epoch based reclamation or DEBRA. These can ensure that an object is only destroyed once it is guaranteed that all references to it have been dropped, so it can no longer be accessed by any thread.

My xenium library provides C++ implementations of various reclamation schemes (including all those previously mentioned) that you could use in this situation.

mpoeter
  • 2,574
  • 1
  • 5
  • 12