Lockfree buffer updates with variable-length messages in C

Question

I have 2 buffer of size N. I want to write to the buffer from different threads without using locks.

I maintain a buffer index (0 and 1) and an offset where the new write operation to buffer starts. If I can get the current offset and set the offset at offset + len_of_the_msg in an atomic manner, it will guarantee that the different threads will not overwrite each other. I also have to take care of buffer overflow. Once a buffer is full, switch buffer and set offset to 0.

Task to do in order:

set a = offset
increment offset by msg_len
if offset > N: switch buffer, set a to 0, set offset to msg_len

I am implementing this in C. Compiler is gcc.

How to do this operations in an atomic manner without using locks? Is it possible to do so?

EDIT:

I don't have to use 2 buffers. What I want to do is "Collect log message from different threads into a buffer and send the buffer to a server once some buffer usage threshold is reached"

You'll start by using add-and-fetch using GCC atomics, Good place to start: https://stackoverflow.com/questions/2353371/how-to-do-an-atomic-increment-and-fetch-in-c — Steve Friedl, Jul 05 '20 at 16:57
One thing to keep in mind is that lock-free does not mean 0-cost. There are many cases where using mutexes can be far more efficient. — Thomas Jager, Jul 05 '20 at 16:59
Updated with part of an idea for using a single circular buffer. — Peter Cordes, Jul 06 '20 at 06:33

Peter Cordes · Answer 1 · 2020-07-06T06:39:06.657

re: your edit:

I don't have to use 2 buffers. What I want to do is: Collect log message from different threads into a buffer and send the buffer to a server once some buffer usage threshold is reached

A lock-free circular buffer could maybe work, with the reader collecting all data up to the last written entry. Extending an existing MPSC or MPMC queue based on using an array as a circular buffer is probably possible; see below for hints.

Verifying that all entries have been fully written is still a problem, though, as are variable-width entries. Doing that in-band with a length + sequence number would mean you couldn't just send the byte-range to the server, and the reader would have to walk through the "linked list" (of length "pointers") to check the sequence numbers, which is slow when they inevitably cache miss. (And can possibly false-positive if stale binary data from a previous time through the buffer happens to look like the right sequence number, because variable-length messages might not line up the same way.)

Perhaps a secondary array of fixed-width start/end-position pairs could be used to track "done" status by sequence number. (Writers store a sequence number with a release-store after writing the message data. Readers seeing the right sequence number know that data was written this time through the circular buffer, not last time. Sequence numbers provide ABA protection vs. a "done" flag that the reader would have to unset as it reads. The reader can indicate its read position with an atomic integer.)

I'm just brainstorming ATM, I might get back to this and write up more details or code, but I probably won't. If anyone else wants to build on this idea and write up an answer, feel free.

It might still be more efficient to do some kind of non-lock-free synchronization that makes sure all writers have passed a certain point. Or if each writer stores the position it has claimed, the reader can scan that array (if there are only a few writer threads) and find the lowest not-fully-written position.

I'm picturing that a writer should wake the reader (or even perform the task itself) after detecting that its increment has pushed the used space of the queue up past some threshold. Make the threshold a little higher than you normally want to actually send with, to account for partially-written entries from previous writers not actually letting you read this far.

If you are set on switching buffers:

I think you probably need some kind of locking when switching buffers. (Or at least stronger synchronization to make sure all claimed space in a buffer has actually been written.)

But within one buffer, I think lockless is possible. Whether that helps a lot or a little depends on how you're using it. Bouncing cache lines around is always expensive, whether that's just the index, or whether that's also a lock plus some write-index. And also false sharing at the boundaries between two messages, if they aren't all 64-byte aligned (to cache line boundaries.)

The biggest problem is that the buffer-number can change while you're atomically updating the offset.

It might be possible with a separate offset for each buffer, and some extra synchronization when you change buffers.

Or you can pack the buffer-number and offset into a single 64-bit struct that you can attempt to CAS with atomic_compare_exchange_weak. That can let a writer thread claim that amount of space in a known buffer. You do want CAS, not fetch_add because you can't build an upper limit into fetch_add; it would race with any separate check.

So you read the current offset, check there's enough room, then try to CAS with offset+msg_len. On success, you've claimed that region of that buffer. On fail, some other thread got it first. This is basically the same as what a multi-producer queue does with a circular buffer, but we're generalizing to reserving a byte-range instead of just a single entry with CAS(&write_idx, old, old+1).

(Maybe possible to use fetch_add and abort if the final offset+len you got goes past the end of the buffer. If you can avoid doing any fetch_sub to undo it, that could be good, but it would be worse if you had multiple threads trying to undo their mistakes with more modifications. That would still leave the possible problem of a large message stopping other small messages from packing into the end of a buffer, given some orderings. CAS avoids that because only actually-usable offsets get swapped in.)

But then you also need a mechanism to know when that writer has finished storing to that claimed region of the buffer. So again, maybe extra synchronization around a buffer-change is needed for that reason, to make sure all pending writes have actually happened before we let readers touch it.

A MPMC queue using a circular buffer (e.g. Lock-free Progress Guarantees) avoids this by only having one buffer, and giving writers a place to mark each write as done with a release-store, after they claimed a slot and stored into it. Having fixed-size slots makes this much easier; variable-length messages would make that non-trivial or maybe not viable at all.

The "claim a byte-range" mechanism I'm proposing is very much what lock-free array-based queues, to, though. A writer tries to CAS a write-index, then uses that claimed space.

Obviously all of this would be done with C11 #include <stdatomic.h> for _Atomic size_t offsets[2], or with GNU C builtin __atomic_...

Steve Friedl · Answer 2 · 2020-07-05T19:13:49.420

I believe this is not solvable in a lock-free manner, unless you're only ruling out OS-level locking primitives and can live with brief spin locks in application code (which would be a bad idea).

For discussion, let's assume your buffers are organized this way:

#define MAXBUF 100 

struct mybuffer {
    char data[MAXBUF];
    int  index;
};

struct mybuffer Buffers[2];
int    currentBuffer = 0;  // switches between 0 and 1

Though parts can be done with atomic-level primitives, in this case the entire operation has to be done atomically so is really one big critical section. I cannot imagine any compiler with a unicorn primitive for this.

Looking at the GCC __atomic_add_fetch() primitive, this adds a given value (the message size) to a variable (the current buffer index), returning the new value; this way you could test for overflow.

Looking at some rough code that is not correct;

   // THIS IS ALL WRONG!

   int oldIndex = Buffers[current]->index;

   if (__atomic_add_fetch(&Buffers[current]->index, mysize, _ATOMIC_XX) > MAXBUF)
   {
      // overflow, must switch buffers
      // do same thing with new buffer
      // recompute oldIndex
   }
   // copy your message into Buffers[current] at oldIndex

This is wrong in every way, because at almost every point some other thread could sneak in and change things out from under you, causing havoc.

What if your code grabs the oldIndex that happens to be from buffer 0, but then some other thread sneaks in and changes the current buffer before your if test even gets to run?

The __atomic_add_fetch() would then be allocating data in the new buffer but you'd copy your data to the old one.

This is the NASCAR of race conditions, I do not see how you can accomplish this without treating the whole thing as a critical section, making other processes wait their turn.

void addDataTobuffer(const char *msg, size_t n)
{
    assert(n <= MAXBUF);  // avoid danger

    // ENTER CRITICAL SECTION

    struct mybuffer *buf = Buffers[currentBuffer];

    // is there room in this buffer for the entire message?
    // if not, switch to the other buffer.
    //
    // QUESTION: do messages have to fit entirely into a buffer
    // (as this code assumes), or can they be split across buffers?

    if ((buf->index + n) > MAXBUF)
    {
        // QUESTION: there is unused data at the end of this buffer,
        // do we have to fill it with NUL bytes or something?

        currentBuffer = (currentBuffer + 1) % 2; // switch buffers
        buf = Buffers[currentBuffer];
    }

    int myindex = buf->index;
    buf->index += n;

    // copy your data into the buffer at myindex;

    // LEAVE CRITICAL SECTION
}

We don't know anything about the consumer of this data, so we can't tell how it gets notified of new messages, or if you can move the data-copy outside the critical section.

But everything inside the critical section MUST be done atomically, and since you're using threads anyway, you may as well use the primitives that come with thread support. Mutexes probably.

One benefit of doing it this way, in addition to avoiding race conditions, is that the code inside the critical section doesn't have to use any of the atomic primitives and can just be ordinary (but careful) C code.

An additional note: it's possible to roll your own critical section code with some interlocked exchange shenanigans, but this is a terrible idea because it's easy to get wrong, makes the code harder to understand, and avoids tried-and-true thread primitives designed for exactly this purpose.

There's maybe something you can do by capturing the return value of `__atomic_add_fetch`, so you know which window of the buffer you ended up getting. `int oldIndex = Buffers[current]->index;` separate from the fetch_add is clearly a bug. You know what you're adding, so given a return value from `__atomic_add_fetch` or `int oldindex = __atomic_fetch_add`, you can work out the correct start and end positions that you've atomically claimed. This has to be on a per-buffer basis, not one global index, else you need to CAS a struct of `{bufnum, index}`. — Peter Cordes, Jul 06 '20 at 03:26
TL:DR: I think you're rejecting it too quickly based on problems with your attempt which are actually fixable. I think lockless within each buffer is possible, with locking only needed when swapping buffers. — Peter Cordes, Jul 06 '20 at 03:26
@PeterCordes - I'm rejecting a fully lock-free approach because I don't think it's possible, which is what the OP asked about, so I didn't really even try for a less-locking approach. It may well be possible to reduce the locking to the case you noted, but I don't think we know enough to read the OP's mind about the wider environment. Notifying a consumer of these buffers is going to be a thing as well. I'd love to see sample code that exhibited the less-locking behavior you're thinking of. — Steve Friedl, Jul 06 '20 at 03:53
I'm picturing that the writers (while filling a buffer) operate pretty much like a standard lock-free queue, claiming a byte-range in the buffer just like a normal lock-free queue writer would claim one write slot. (e.g. like the MPMC queue in [this Q&A](//stackoverflow.com/q/45907210/lock-free-progress-guarantees). Of course it uses a one buffer as a circular buffer.) I haven't tried to think through how to actually handle the transitions between buffers :P, rather than the usual full/empty detection for a MPMC queue (which also gives readers a way to wait for writes to claimed entries). — Peter Cordes, Jul 06 '20 at 04:26
I don't have to use 2 buffers. What I want to do is "Collect messages from different threads into a buffer. Send the buffer to a server once the buffer is full" — Prabhakar Tayenjam, Jul 06 '20 at 04:41

Lockfree buffer updates with variable-length messages in C

2 Answers2

If you are set on switching buffers: