"Lock free" write-once, read-often value in conjunction with thread pools

Question

I am facing the following problem:

My program uses the concept of "tasks", where a task exists of code (functions) that together make up a critical area. Each task (object) knows what its "key" is, but that key might be shared with other tasks. The "key" (a pointer to yet another resource-- called B below) does not change for the lifetime of the task. Yet, it has to exist before I can create the second resource.

To clarify, from the point of the view of a task:

Create resource A, if that does not already exist.
Once we have A, create resource B(A), if that does not already exist.

where B is a function of A: only one B should be created per A. Multiple such tasks exist and some might create different A's, or see that their A already exists and use the same A as other tasks.

The creation of B hence requires its own mutex: all tasks run in parallel, but only one B (per A) should be created.

Once B is created however, I do not want to have to lock that mutex every time - just in case it is still being created. The creation only happens once and reading will happen very often the rest of the execution of the program.

The program flow is therefore as follows:

class Application {
  std::map<A*, B> all_B;
  std::mutex all_B_mutex;

  B* B_broker(A*);
};

struct A {
  std::atomic<B*> cached_B_ptr;
};

Each task, as soon as it has A, calls B_broker(A) - once. This function will lock a mutex and create B, if that doesn't exist yet, and return a pointer to that new B.
The thread that gets a non-NULL pointer from B_broker(A) stores this B* in a member variable of A: A::cached_B_ptr.
If B(A) already exists then B_broker(A) will return nullptr. The calling thread then waits until it sees that A::cached_B_ptr is non-null.
All accesses to A::cached_B_ptr are with memory_order_relaxed, so that there is no difference with just reading normal variables without a lock (hence the "lock free" in the title). During normal operation of the program this variable is only being read, so there is no contention.

Although a given task runs inside its own critical area, different tasks might share the same A, so access to A::cached_B_ptr is not protected by a mutex (or, the same mutex anyway).

The question now is: can this be done safely? Where,

thread A writes to A::cached_B_ptr (relaxed) - no related mutexes.
thread B runs a task, unrelated to A, and wants to use A::cached_B_ptr. B does not write to A::cached_B_ptr (B_broker(A) returned nullptr) but since A::cached_B_ptr is not protected by a mutex is also can't rely on it already being set; so B waits until it reads a non-null value.
thread C continuous to run the task of B and also wants to use A::cached_B_ptr, but this time without any checks or waiting. What B did is a once-per-task operation that is only done when a task is started, and that only shortly after the program was started (when there is still a possible race with A writing to A::cached_B_ptr).

PS The follow two questions might be related:

Do you *need* lazy init? There are many of these and some never get initialized? I initially thought of RCU for read-mostly, but in this case the old value isn't valid, so it's not useful to let other threads keep reading it while cooking up a new data structure. Instead you maybe use a separate guard variable like how compilers handle `static` local vars with non-constant initializers, so the fast path is just an acquire load. https://godbolt.org/z/Tjvbzo5xM (Or relaxed if you membarrier all threads?) — Peter Cordes, Mar 04 '22 at 02:52

Carlo Wood · Answer 1 · 2022-03-03T17:06:23.443

I tested this using genmc using the following program:

#include <stdlib.h>
#include <pthread.h>
#include <stdatomic.h>
#include <stdio.h>
#include <assert.h>

void __VERIFIER_assume(int);

atomic_int x;
atomic_int order;
atomic_int c;

pthread_mutex_t mutex;

void* thread_A(void* arg)
{
  atomic_store_explicit(&x, 12345, memory_order_relaxed);
  return NULL;
}

void* thread_B(void* arg)
{
  pthread_mutex_lock(&mutex);

  // Read x and write it to b.
  int read_value = atomic_load_explicit(&x, memory_order_relaxed);
  __VERIFIER_assume(read_value == 12345);

  // Assume B runs before C.
  int o = atomic_load_explicit(&order, memory_order_relaxed);
  __VERIFIER_assume(o == 0);

  pthread_mutex_unlock(&mutex);
  return NULL;
}

void* thread_C(void* arg)
{
  pthread_mutex_lock(&mutex);

  // Read x and write it to c.
  int read_value = atomic_load_explicit(&x, memory_order_relaxed);
  atomic_store_explicit(&c, read_value, memory_order_relaxed);

  // Keep track of the order in which the threads B and C run.
  atomic_store_explicit(&order, 1, memory_order_relaxed);

  pthread_mutex_unlock(&mutex);
  return NULL;
}

int main()
{
  pthread_t tA;
  pthread_t tB;
  pthread_t tC;

  if (pthread_create(&tA, NULL, thread_A, NULL))
    abort();
  if (pthread_create(&tB, NULL, thread_B, NULL))
    abort();
  if (pthread_create(&tC, NULL, thread_C, NULL))
    abort();

  if (pthread_join(tA, NULL))
    abort();
  if (pthread_join(tB, NULL))
    abort();
  if (pthread_join(tC, NULL))
    abort();

  printf("c = %d\n",
      atomic_load_explicit(&c, memory_order_relaxed));

  return 0;
}

Note that genmc is for C code only; but the memory model that it uses is the same as that of C++. It verifies the possible executions of the virtual machine upon which the memory model is based at the level of LLVM Intermediate Representation and it should not matter that the syntax is based on pthreads.

The snippet under test run a thread A that writes some value to x, which corresponds to our std::atomic<B*> A::cached_B_ptr.

void* thread_A(void* arg)
{
  atomic_store_explicit(&x, 12345, memory_order_relaxed);
  return NULL;
}

This is a relaxed write without any locking.

Then an independent thread B reads x, and loops until it reads a non-NULL value. Or, instead we read it once and tell genmc that we assume it read the value that A writes:

int read_value = atomic_load_explicit(&x, memory_order_relaxed);
__VERIFIER_assume(read_value == 12345);

A thread C plays the role of running the task that B started after B finished this initialization (calling B_broker(A) seeing it returned nullptr and then reading A::cached_B_ptr until that returned a non-NULL value). Hence we are only interested in executions of this snippet where C runs after B. This is achieved by having an atomic order that is set to 1 in thread C, and verify/assume that it is still 0 in thread B:

// Assume B runs before C.
int o = atomic_load_explicit(&order, memory_order_relaxed);
__VERIFIER_assume(o == 0);

Finally, in thread C we read x to verify that it always is synced with what A wrote, even though there is absolutely no synchronization with thread A - and all accesses to x are relaxed. This is done by reading x in C and writing it to a global variable (c) for later inspection. After that order is set to 1.

// Read x and write it to c.
int read_value = atomic_load_explicit(&x, memory_order_relaxed);
atomic_store_explicit(&c, read_value, memory_order_relaxed);

// Keep track of the order in which the threads B and C run.
atomic_store_explicit(&order, 1, memory_order_relaxed);

The output of the program is:

c = 12345
No errors were detected.
Number of complete executions explored: 1
Number of blocked executions seen: 2
Total wall-clock time: 0.01s

In other words, within the set restrictions that B ran before C and B sees the value written by A, also C always sees that same value. If that was not the case, c = 0 would also have been printed and the number of complete executions would have been larger than 1.

"Lock free" write-once, read-often value in conjunction with thread pools

1 Answers1