Is the implementation of the hazard pointer in C++ Concurrency in Action flawed?

Question

I am reading the second edition of C++ Concurrency in Action. The code below is from Listing 7.6. It implements pop() for the stack using hazard pointers.

std::shared_ptr<T> pop() {
  std::atomic<void*>& hp = get_hazard_pointer_for_current_thread();
  node* old_head = head.load();  // #1
  do {
    node* temp;
    do {
      temp = old_head;
      hp.store(old_head);        // #2
      old_head = head.load();    // #3
    } while (old_head != temp);  // #4
  } while (old_head &&
           !head.compare_exchange_strong(old_head, old_head->next));
  hp.store(nullptr);
  // ...
}

The book explains the role of the inner loop:

You have to do this in a while loop to ensure that the node hasn't been deleted between the reading of the old head pointer #1 and the setting of the hazard pointer #2. During this window no other thread knows you're accessing this particular node. Fortunately, if the old head node is going to be deleted, head itself must have changed, so you can check this and keep looping until you know that the head pointer still has the same value you set your hazard pointer to #4.

According to the implementation of pop, if another thread deletes the head node by pop between #1 and #2, then head will be modified to the new node.

What confuses me is, can the modification of head by other threads be seen by the current thread in time? For example, if the new value of head has not been propagated to the current thread, then #1 and #3 will still read the same value (the old value), causing the inner while loop to exit and in turn the outer while loop accessing old_head->next, resulting in undefined behavior.

I've searched stackoverflow for some answers. For example, this answer says

The default memory ordering of std::memory_order_seq_cst provides a single global total order for all std::memory_order_seq_cst operations across all variables. This doesn't mean that you can't get stale values, but it does mean that the value you do get determines and is determined by where in this total order your operation lies.

This comment says

Every atomic variable has its own modification order that all threads can agree on, but that only serializes modifications, not reads. The coherency requirements involving reads just guarantee that if you've seen one value in the modification order, you can't see an earlier one.

But cppreference says

Each memory_order_seq_cst operation B that loads from atomic variable M, observes one of the following:

the result of the last operation A that modified M, which appears before B in the single total order

...

So what exactly should the answer to my question be?

Also, if a weaker ordering (like release-acquire or relaxed) is used here, would the above problem arise?

Here is the code of pop:

std::shared_ptr<T> pop() {
  std::atomic<void*>& hp = get_hazard_pointer_for_current_thread();
  node* old_head = head.load();  // #1
  do {
    node* temp;
    do {
      temp = old_head;
      hp.store(old_head);        // #2
      old_head = head.load();    // #3
    } while (old_head != temp);  // #4
  } while (old_head &&
           !head.compare_exchange_strong(old_head, old_head->next));
  hp.store(nullptr);  // Clear hazard pointer once you're finished
  std::shared_ptr<T> res;
  if (old_head) {
    res.swap(old_head->data);
    if (outstanding_hazard_pointers_for(old_head)) // Check for hazard pointers referencing a node before you delete it.
      reclaim_later(old_head);
    else
      delete old_head;
    delete_nodes_with_no_hazards();
  }
  return res;
}

pop() pops the node pointed to by head and frees it when no hazard pointer points to it. Modifying the head is achieved by compare_exchange_strong.

How is the hazard pointer `hp` used elsewhere in the code? It can have no effect on this function itself, because it is only written and never read. — Nate Eldredge, Oct 10 '22 at 13:56
When thinking about lock-free atomic programming, I find it helpful to forget about the word "time" and everything related to it. The whole point of the memory model is that it is defined in terms of orders which in principle have nothing to do with the clock on the wall (though in real life they probably do). I presume the "stale" in the first quoted answer means "you may fail to load a value which was stored earlier with respect to real time", but that should be irrelevant, as unless you are doing real-time programming, your program should not care how real time passes. — Nate Eldredge, Oct 10 '22 at 14:04
@NateEldredge I only selected part of the code because I don't think the rest is relevant to the problem. `hp` is used in the rest of `pop`. — Pluto, Oct 10 '22 at 14:07
@NateEldredge But if the previously stored value is not read, it can cause the problem I describe. Is this possible? — Pluto, Oct 10 '22 at 14:09
I think its use *is* relevant. The code here seems to be trying to preserve some invariant relationship between `head` and the hazard pointer. It's not clear to me from this code alone, or your excerpt of the explanation, what that invariant is supposed to be. But we could perhaps infer it by looking at whatever code elsewhere actually relies on that invariant. It can't be the code here because its behavior is totally unaffected by the value of the hazard pointer. Once we know what the invariant is, we can see whether this code correctly preserves it. — Nate Eldredge, Oct 10 '22 at 14:12
Thread safety is a relational property; two pieces (or more) of code are thread safe relative to each other. Asking if a single thread is thread safe is nonsense: at best an implied family of other threads is assumed. You have shared one piece of code. You need the code where head is modified, and possibly where hp is used, to determine relatove thread safety. This inability for seperate analysis is why concurrency is fundamentally hard. — Yakk - Adam Nevraumont, Oct 10 '22 at 14:14
@NateEldredge OK, I'll post the full code. But I think I can briefly describe the relationship between `hp` and `old_head`. The code requires that `hp` and `old_head` must have the same value, because only if there is a `hp` pointing to a node, the node will not be deleted, and thus the pointer will be valid. So, to safely dereference `old_head`, `hp` must be guaranteed to have the same value as it. — Pluto, Oct 10 '22 at 14:17
"`hp` must be guaranteed to have the same value as it." But *when*? All we ever can actually observe is whether some load of `hp` and some load of `head` return the same value. Those two loads can't be simultaneous, so it is crucial to know how they are ordered with respect to each other. — Nate Eldredge, Oct 10 '22 at 14:20
@NateEldredge Yes. From what I understand, when we try to dereference `old_head`, we must ensure that the value of `hp` is the same as it, not `head` but `old_head`, because `old_head` is the node to be deleted by the current thread. But IMHO this seems to have strayed from my question. — Pluto, Oct 10 '22 at 14:31
Well, your question was "is the code flawed", which can't be answered without seeing both sides of it. But I think I see what you are asking. It is possible that another thread stored to `head` (call the store W) "between" #1 and #3 with respect to real time, yet #1 and #3 both return the same value. But if so, then W must come *after* #3 in the sequentially consistent total ordering, and as such, also comes after #2. This would allow us to deduce that the other thread will not free the pointer which we loaded at #1 and #3. — Nate Eldredge, Oct 10 '22 at 15:23
As you see, where W occurred in real time is simply irrelevant. The values observed by loads #1 and #3 tell us something about where W appears in the total order, and that's all that matters for the purposes of describing the program's observable behavior. — Nate Eldredge, Oct 10 '22 at 15:25
And as a quick comment, sequential consistency is essential here. Acquire/release, much less relaxed, cannot stop StoreLoad reordering, and so could not ensure that #2 and #3 are observed in that order by other threads, which is critical for the algorithm. I'll try to post a more complete answer later. — Nate Eldredge, Oct 10 '22 at 15:28
@NateEldredge thanks. I look forward to your answer. But I found [a project](https://github.com/tokio-rs/tokio/issues/1768#issuecomment-554823949) that once caused a bug by using load operations with seq-cst order. Thread A modifies an atomic variable and then sleeps. Thread B reads the atomic variable and determines whether A should be woken up based on the value. However, because the Load operation did not read the updated value in time, thread B did not wake up A. — Pluto, Oct 11 '22 at 01:50
@NateEldredge The project then replaced the Load operation with fetch_add(0) to [fix the bug](https://github.com/tokio-rs/tokio/commit/bf56b50a855f5d1f08f15dce0f4138e17cfa6f6e). I think the problem that arises here is the same as the one I describe. How to explain this? — Pluto, Oct 11 '22 at 01:51
@Pluto: Well, the stack code in your question doesn't have a problem at all AFAICS, so it can't be the same problem. I haven't yet worked out what the tokio bug actually was, and on first impression, I'm skeptical that `fetch_add(0)` actually fixed it. It may very well have *hidden* the bug so it no longer appears in practice, at least on whatever compilers and CPU architectures they tested, because real machines often provide stronger ordering guarantees than the abstract memory model requires. But my sneaking suspicion is that in theory, their unwanted race can still occur. — Nate Eldredge, Oct 11 '22 at 05:10
On x86 for instance, every atomic read-modify-write is a full barrier, so a `fetch_add(0)`, even if it were relaxed, would have similar effect to a seq_cst fence. But the memory model doesn't guarantee that, and on other architectures it doesn't happen. — Nate Eldredge, Oct 11 '22 at 05:12

Nate Eldredge · Accepted Answer · 2022-10-11T13:28:27.163

No, I don't think it's flawed.

To verify that the algorithm is correct, let me annotate a few more lines of code here. I rewrote line 8 to make it clear that it does a load from the other thread's hazard pointer.

std::shared_ptr<T> pop() {
  std::atomic<void*>& hp = get_hazard_pointer_for_current_thread();
  node* old_head = head.load();  // #1
  do {
    node* temp;
    do {
      temp = old_head;
      hp.store(old_head);        // #2
      old_head = head.load();    // #3
    } while (old_head != temp);  // #4
  } while (old_head &&
           !head.compare_exchange_strong(old_head, old_head->next));
                                 // #5 deref of old_head
                                 // #6 the compare_exchange
  hp.store(nullptr);             // #7 
  std::shared_ptr<T> res;
  if (old_head) {
    res.swap(old_head->data);
    if (other_thread_hp.load() == old_head)
                                 // #8
      reclaim_later(old_head);
    else
      delete old_head;           // #9
    delete_nodes_with_no_hazards();
  }
  return res;
}

Informal justification

Suppose thread A is trying to safely dereference old_head while Thread B wants to delete it.

It is entirely possible that, for instance, the compare_exchange_strong in thread B line 6 (B6 for short) occurs between the loads A1 and A3 with respect to real time, but does not become visible to thread A until later.

However, we have sequential consistency. Every thread must observe the operations B6 and B8 in that order. So if B6 didn't become visible to A until after A3, then B8 cannot become visible to A until later still, and by that time, store A2 will have taken effect. In other words, load B8 must observe all stores that preceded A3, including in particular A2. So either B8 will return the non-null hazard pointer from A, in which case the delete is not done; or else it returns nullptr, which can only happen if store A7 has become visible, and by that time the dereference A5 has already completed.

Thus, if there can be a delay between when B6 is executed and when it becomes globally visible, then the implementation must arrange that all subsequent operations in thread B, including in particular B8, are stalled until after B6 does become visible. On present-day hardware, the typical reason for such a delay is that the store at B6 went into a store buffer. So to ensure sequential consistency, the compiler has to insert a barrier instruction between B6 and B8, which will wait for the store buffer to drain and commit to coherent L1 cache before continuing on.

Edit: To your question in comments about delaying non-atomic operations: Things are complicated a bit because B6 is a read-modify-write, but for memory ordering purposes, you can think of it as consisting of a load (B6L) and a store (B6S), both of which are seq_cst. For purposes of ordering with respect to non-seq_cst operations, including all non-atomic operations, a seq_cst store is simply a release store, and a seq_cst load is merely an acquire load. So indeed, non-atomic operations that follow B6S do not have to "stall" and, unless otherwise restricted, may become visible before B6S does. (They cannot become visible before B6L however, because it is acquire.)

But by the same token, B8 is acquire. This does require later operations, including non-atomic operations such as B9, to stall until after B8 has become visible. (Here B9 is on one side of a conditional branch which depends on the value loaded by B8, but without the acquire barrier, it could still start doing loads speculatively.) So indeed, the net result is that B9 must become visible only after B6S, because B9 must wait for B8 which must wait for B6S.

Formal proof of correctness

Remember that the C++ memory model has no concept of "in time" or "stale"; everything is defined in terms of abstract orderings. Here all the atomic operations are seq_cst by default, so there is a total order on all of them, with coherency rules ensuring that they observe each other in all the usual ways.

We'll write A1, B1, etc, for operation #1 as executed by Threads A or B respectively. Also, let hpA, hpB be the hazard pointers belonging to the corresponding threads. Let H0 be the value of head going into the code here, which both threads initially load as their old_head, and H1, H2 the addresses of the following nodes.

We want to make sure that A5 happens before B9. If A5 is going to dereference pointer H0, it must be that loads A1, A3 both returned H0, which means that A2 stored H0 to hpA. (Or if A1 didn't, then both the second-to-last and last iterations of A3 must have loaded H0; either way the conclusion is that A2 stored H0.)

Likewise, if B6 is going to delete H0, it must be that B6 loaded H0 from head and stored H1. So if both these conditions hold, then A3 precedes B6 in the total order (or A3 < B6 for short); otherwise A3 would have loaded H1 instead. By transitivity, and the fact that the seq_cst total order is consistent with sequencing (program order), we have A2 < A3 < B6 < B8.

Now since it is a total order, either A7 < B8 or A7 > B8.

If A7 < B8, then B8 observes nullptr stored by A7. Since A7 was a release store (implied by seq_cst), and B8 was an acquire load that observed the value stored by A7, we have that A7 happens before B8. By sequencing A5 happens before A7, and B8 happens before B9, so A5 happens before B9. Thus B9 will delete H0, but A5 has already finished dereferencing it and there is no data race or use-after-free.
If A7 > B8, then we have A3 < B8 < A7. So B8 must observe the store A3 (which stored H0 to hpA), and it must not observe store A7 which stored nullptr. So in this case B8 loads the value H0, which equals thread B's old_head, and the deletion of H0 is deferred by thread B. Thus the dereference at A5 was safe, because H0 is not being deleted at all.

Acquire-release ordering would not be good enough for this algorithm. Informally, acquire-release forbids LoadLoad, StoreStore, and LoadStore reordering, but still allows for StoreLoad reordering. As such, A2 could become visible after A3, which would be disastrous. Edit: And to your comment below, yes, B6S could also become visible after B8 and B9 (with B7 becoming visible later still).

Relaxed ordering would be even worse; for instance, A7 could become visible before A5 has completed.

*the implementation must arrange that all subsequent operations in thread B, including in particular B8, are stalled until after B6 does become visible.* Does the delayed operation include other non-atomic operations? For example, the (possible) `delete` operation at `#9` here. — Pluto, Oct 11 '22 at 08:48
Also, if a weaker ordering is used, will the operations after B6 still be delayed if B6's modifications to the head are not seen by A1 and A3? That is, is it possible for A to see the deleted node first, and then see the modification result of B6? — Pluto, Oct 11 '22 at 09:21

Is the implementation of the hazard pointer in C++ Concurrency in Action flawed?

1 Answers1

Linked