single-reader single-writer fix-sized ringbuf, without lock and atomic variable, is it always safe for any CPU arch?

Question

Was asked to wrap a blocking call to non-blocking during an interview today. So we (the interviewer and me) decided to achieve that by adding a background thread inside the nonblocking API. Here is the code I wrote:

 30 #define ARRAY_SIZE(x)   (sizeof(x)/sizeof(x[0]))
 31
 32 struct SensorReading records[60*10] = {{0}};
 33 size_t head = 0;
 34 
 35
 36 void * worker_thread(void *arg) {
 37     while (1) {
 38         size_t idx = (head + 1) % ARRAY_SIZE(records);
 39         records[idx] = read_next_sample();
 40         head = idx;
 41     }
 42 }
 43
 44 float get_most_recent_lux() {
 45     static pthread_t worker = -1;
 46     if (-1 == worker) {
 47         struct SensorReading r = read_next_sample(); // This is the blocking call
 48         records[0] = r;
 49         if (-1 == pthread_create(&worker, NULL, worker_thread, NULL)) {
 50             // error handling
 51         }
 52         return r.lux;
 53     }
 54     return records[head].lux;
 55 }

Let me explain a little bit here:

read_next_sample() is the blocking call provided;
line 44 get_most_recent_lux() is the wrapped non-blocking API I need to provide.
Internally, it starts a thread which executes the worker_thread() function defined in line 36.
worker_thread() keeps calling the blocking call and writes data to a ringbuf.
So the reader can read most recent record data from the ringbuf.

Also please be noted:

This programming language used here is C, not C++.
This is a single reader single writer case.
This is different than producer-consumer problem, since the wrapped API get_most_recent_lux() should always return most recent data.

Since this is a single reader single writer case, I believe:

No lock required here.
No atomic values required here. (so the head in line 33 is not declared as atomic values, and I use a normal evaluation operation (head = idx) in line 40).

Question: Is my above statement correct?

The interviewer keeps telling me that my statement is not correct for all CPU arch, so he believes mutex or atomic variable is required here. But I don't think so. I believe, indeed, the single line evaluation C code (head = idx) can be translated to multiple assembly instructions, but only the last assembly instruction is for storing the updated value to the memory. So,

before the last assembly instruction executed, the updated value has not been updated to the memory yet, so the reader will always read the old head value.
after the last assembly instruction executed, the reader will always read the updated head value.
in both case, it is safe, no corruption will happen.
There is no other possibilities. Within a specified time period where only 1 write can happen (let's say change from 1 to 2), the reader can only read either 1 or 2, the reader will never read any value other than 1 or 2, like 0, 3, or 1.5.

Agree? I really can not believe there is any CPU arch that the code won't work. If there is, please educate me. Thanks very much.

Peter Cordes · Answer 1 · 2021-01-22T16:01:07.480

3

You don't need any atomic RMWs or seq_cst, but you do need _Atomic to do release-store and acquire-load to/from head.

That only happens for free on x86 (and SPARC), not other ISAs, and is still unsafe vs. compile-time reordering even when targeting x86. head = idx; could become visible to the other core before the updates to records[idx], letting it read the stale values.

(Well, the records[head].lux load part will actually work on most ISAs like mo_consume, since ISAs other than DEC Alpha guarantee dependency ordering for loads.)

I think there are some other similar Q&As on SO about trying to use non-atomic variables for inter-thread communication. There's little point, just use atomic_store_explicit with memory_order_release - it will compile the same on x86 as a non-atomic store, but with compile time ordering guarantees. So you're not gaining efficiency by avoiding stdatomic.h, if you only use acquire and release. (Except for loads - if you want actual dependency-ordering without a barrier on weakly-ordered ISAs, you have to use relaxed under controlled conditions on weakly-ordered ISAs, because consume is currently semi-deprecated and promotes to acquire in current compilers.) See When to use volatile with multi threading? for more about why hand-rolled atomics work and why they have no advantage.

Also, note that you have no protection against the queue becoming full and overwriting values that haven't been read yet. SPSC queues like this typically have the consumer side update a read index that the writer can check.

edited Jan 22 '21 at 16:01

answered Jan 22 '21 at 15:21

Peter Cordes

328,167
45
605
847

Thanks, Peter. Let me comment on `protection against the queue becoming full and overwriting values that haven't been read yet` first before studying more about the other part of your comment. The protection is not required here, because the wrapped API just needs to provide most recent record. It is fine that the record generated by writer is not consumed in this case. As I said this is different than producer/consumer problem. – Yun Wu Jan 22 '21 at 19:11
1

@YunWu: Oh I see, so kind of like a seqlock except an old value is always available while a new value is being written. Not really a queue, just a latest-value snapshot, with a big enough buffer to make it unlikely (but not impossible) to the writer to wrap around and overwrite this entry while the reader is reading. (Everything I said about needing `_Atomic` head with release and acquire still applies, even if you aren't trying to guard against the writer wrapping around.) – Peter Cordes Jan 22 '21 at 19:51
2

@YunWu: Still, you have a problem, since in case of wrap around, `records[head].lux` may be read at the same time as it is being written. And even if it is of a type that would normally be atomic, or even if it is declared `_Atomic`, the fact that you're writing it by an assignment of the whole `struct SensorReading` means that it need not be written atomically. (Imagine a compiler that uses `memcpy` for struct assignment, which could be a loop of one-byte loads and stores.) So there's still a bug there - you could get corrupted data returned. – Nate Eldredge Jan 22 '21 at 20:12
1

In this particular case, since you only use one member of `struct SensorReading`, you could make the `lux` member `_Atomic` and do the copy in `worker_thread` member by member, so that `lux` is stored atomically. (Though this raises the question of why you have any other members at all. Or for that matter, why even have a ring buffer, just have a single `_Atomic` object containing the most recent `lux` value.) If you really needed several members, then it seems like doing it safely and lock-free would become very difficult. – Nate Eldredge Jan 22 '21 at 20:45
@NateEldredge: Good catch! You are right, struct copy is not atomic! Thanks! – Yun Wu Jan 22 '21 at 21:32
@NateEldredge: Extending the idea of a seqlock ([Implementing 64 bit atomic counter with 32 bit atomics](https://stackoverflow.com/q/54611003)) would let you check for possible tearing of the struct you're reading, e.g. maybe by storing a sequence number *in* each bucket next to / as part of the struct, which you read last. Or reading `head` before and after reading the struct, to check that `2nd - 1st < size`. It wouldn't avoid ISO C UB, but there's no good way to avoid that for a seqlock without making everything atomic (and thus defeating wide loads/stores.) – Peter Cordes Jan 22 '21 at 21:33
Maybe the data structure should be designed like this: `#define TABLE_SIZE (60*10)` `int indices[TABLE_SIZE] = {[0 ... TABLE_SIZE-1] = '-1'}; // all -1` `struct SensorReading records[TABLE_SIZE] = {{0}};` Then writer does `size_t idx = (head + 1) % TABLE_SIZE;` `records[idx] = read_next_sample();` `indices[idx] = idx;` `head = idx;` and reader reads `records[indices[head]].lux`. But from @PeterCordes, seems `_Atomic` is still required here. – Yun Wu Jan 22 '21 at 21:35
1

The advantage of this vs. a regular seqlock is that even if the writer is spamming writes, you're likely to avoid needing a retry. So this could be useful for something too large to be efficiently `_Atomic`. You're 100% correct that if `.lux` is small enough (like 64-bit on most ISAs, e.g. `is_always_lock_free`), it should just be a single atomic global var. – Peter Cordes Jan 22 '21 at 21:36
Thanks, @PeterCordes. Good to learn [Seqlock](https://en.wikipedia.org/wiki/Seqlock) . – Yun Wu Jan 22 '21 at 21:50

single-reader single-writer fix-sized ringbuf, without lock and atomic variable, is it always safe for any CPU arch?

1 Answers1