Lock free single real-time writer and one/multiple non-real-time reader

Question

In the paper "Non-blocking synchronization between real-time and non-real-time applications" the following pseudo code is listened in Listing 2. It suggests a non-blocking implementation for a single real-time writer and one or multiple non-real-time readers.

volatile counter : pointer to shared memory area of an integer
data_size : unsigned
pt_base : pointer to shared memory area of data_size bytes

initialization (d_size: unsigned)
    counter := 0
    data_size := d_size
    pt_base := allocate data_size bytes in the shared memory

//Real-time operation
write (new_value: pointer)
    local_counter := *counter
    *counter := local_counter + 1
    copy data_size bytes from *new_value
    *pt_base*data_size
    *counter := local_counter + 2

//Non-real-time operation
read (pt_data: pointer)
    loop
        counter_begin := *counter
        copy data_size bytes from *pt_base to *pt_data *data_size
        counter_end := *counter
        if (counter_end == counter_begin and counter_begin is even)
            break;
        endif
    endloop

How can this correctly be implemented in C with either C11 _Atomic, gcc/clang __atomic or just using volatile and memory barriers (if that is even possible)?

In his answer to another question Peter Cordes hinted that volatile and memory barriers could be enough or at least I understand it like that.

EDIT: So this is what I have now.

struct latency
{
    uint64_t counter;
    uint64_t active;
    uint64_t minimal;
    uint64_t maximal;
    uint64_t sum;
};

struct timer
{
     //timer related stuff
     struct latency latency;
     _Atomic uint32_t latency_seqcount;
};

static struct timer *timer;

//writer invoked by real-time thread
static void writer(void)
{
    uint32_t seqcount;
    struct latency *latency;

    //setup timer / next period
    //timer wait
    //calculate latency

    seqcount = atomic_load_explicit(&timer->latency_seqcount, memory_order_relaxed);
    atomic_store_explicit(&timer->latency_seqcount, seqcount + 1, memory_order_relaxed);
    atomic_thread_fence(memory_order_release);

    latency = &timer->latency;
    latency->active = latency_domain;

    if (latency->counter == 0)
    {
        latency->maximal = latency->active;
        latency->minimal = latency->active;
    }
    else if (latency->active > latency->maximal)
    {
        latency->maximal = latency->active;
    }
    else if (latency->active < latency->minimal)
    {
        latency->minimal = latency->active;
    }

    latency->counter++;
    latency->sum += latency->active;

    atomic_store_explicit(&timer->latency_seqcount, seqcount + 2, memory_order_release);
}

//reader invoked by non-real-time thread
static void reader(void)
{
    struct latency latency;
    uint32_t seqcount_begin;
    uint32_t seqcount_end;

    do
    {
        seqcount_begin = atomic_load_explicit(&timer->latency_seqcount, memory_order_acquire);

        latency.counter = timer->latency.counter;
        latency.active = timer->latency.active;
        latency.minimal = timer->latency.minimal;
        latency.maximal = timer->latency.maximal;
        latency.sum = timer->latency.sum;

        atomic_thread_fence(memory_order_acquire);
        seqcount_end = atomic_load_explicit(&timer->latency_seqcount, memory_order_relaxed);
    }
    while (seqcount_begin != seqcount_end || seqcount_end & 0x01);
    
    //print
}

Sounds like you are looking for seqlock. See https://stackoverflow.com/questions/56419723/which-of-these-implementations-of-seqlock-are-correct — Support Ukraine, Mar 16 '23 at 07:32
See also [Implementing 64 bit atomic counter with 32 bit atomics](https://stackoverflow.com/q/54611003) for a C++11 `std::atomic` version of a SeqLock, which is trivially portable to C11 `_Atomic`. And some discussion about implementing it in ISO C++, almost all of which applies to ISO C. But yes, I'd also recommend a SeqLock; it's never blocking for the writer, only the readers sometimes have to retry. In fact the pseudo-code you've shown is for a SeqLock, but missing the necessary memory ordering and barriers to make sure the stores happen in program order despite non-atomic "payload". — Peter Cordes, Mar 16 '23 at 08:03
@PeterCordes It's exactly my problem that this memory ordering and barriers are missing. And also that the reference and many other references show how it's done in C++ but I want to see correct use in C and also not abstracted by seqlock_t / READ_ONCE kernel abstractions but rather in plain C11 _Atomic / gcc __atomic implementing exactly the algorithm described above. — Markus Fuchs, Mar 16 '23 at 08:11
@PeterCordes my problem is basically that I have a real-time thread with a timer and after I wake up I determine current latency by calculating latency = currentTime - expirationTime and save min/max latency. In another thread (non-real-time FUSE interface) I want to print the latencies. So i need to share this data between threads but don't want to use a mutex as this could block my real-time thread if reading happens at the wrong time. — Markus Fuchs, Mar 16 '23 at 08:24
My code in the linked answer is in C++11; like I said, easy to translate 1:1 to C11. e.g. `seqcount.load(std::memory_order_acquire);` becomes `atomic_load_explicit(&seqcount, memory_order_acquire)`. And `std::atomic_thread_fence(std::memory_order_acquire)` is unchanged except for dropping the `std::`. The code is templated on the type of the payload; you can just manually substitute any type for `T`, such as `struct payload`. — Peter Cordes, Mar 16 '23 at 08:42
`std::atomic seqcount{0};` is `_Atomic unsigned seqcount = 0;`, or `atomic_uint seqcount = 0;` — Peter Cordes, Mar 16 '23 at 08:46
@PeterCordes Thank you very much for your help. I updated my answer with what I basically have now. — Markus Fuchs, Mar 16 '23 at 12:10
This looks like "many non-real-time threads that can't make useful progress opportunistically steal both CPU time and memory bandwidth away from a real-time thread, to cripple scalability and ruin any hope of real-time properties" to me. — Brendan, Mar 16 '23 at 12:14
@Brendan I am open for suggestions / literature on how to do it properly. — Markus Fuchs, Mar 16 '23 at 12:25
@MarkusFuchs: In user-space; there's only 2 ways to avoid a "lock-free wastes resources on work discarded by retries" problem: find a block-free algorithm; or involve the scheduler in a "don't give this thread CPU time until...." way (e.g. mutex/futex, pthread_cond_vars, etc). The alternative is to reduce the magnitude of the problem - busy-waiting loops with explicit pauses and that do less work - e.g. maybe like `loop { if (counter_end == counter_begin and counter_begin is even) { copy data_size bytes from *pt_base to *pt_data *data_size; break; } else { pause(); } }`. — Brendan, Mar 16 '23 at 12:39
That looks reasonable in terms of barriers and stuff, although it won't compile because you have `struct latency *latency;` (an uninitialized pointer to a struct) instead of the `struct latency latency;` that matches how you access it with `.member`. You also don't use the `void *arg` function args. — Peter Cordes, Mar 16 '23 at 16:20
`fence(release);` `store(relaxed)` can be optimized to `store(release)`, which would help on some non x86-64 ISAs like AArch64 that have release-store instructions. A separate `fence()` is needed only at the top of the writer, and bottom of the reader. The reader can also early-out on an odd seqcount instead of reading all members (and like Brennan said, perhaps sleep in that case, although the amount of work you do in the writer while the lock is "held" is pretty minimal, and the writer is RT, so maybe just `_mm_pause()` to avoid stealing the cache line again before the writer finishes.) — Peter Cordes, Mar 16 '23 at 16:24
Also, since your writer already does `++` on one of the member, you could just use that as the sequence number. Have readers `>>1` it to derive the actual count from variable that the writer increments twice. Fewer operations in the "critical section" means fewer retries for the readers. (You might want to pull the branching out of the seqlock itself and just `memcpy` the payload to further mitigate what Brendan pointed out about readers potentially retrying more and stealing more CPU time / cache contention than necessary, depending on how infrequently they actually read.) — Peter Cordes, Mar 16 '23 at 17:25
You can reduce contention by using multiple buffers. Same algorithm, except that you write buffer[(local_counter>>1)&3]; — Matt Timmermans, Mar 17 '23 at 02:04

Lock free single real-time writer and one/multiple non-real-time reader

0 Answers0