Which of these implementations of seqlock are correct?

Question

I am studying the implementation of Seqlock. However all sources I found implement them differently.

Linux Kernel

static inline unsigned __read_seqcount_begin(const seqcount_t *s)
{
    unsigned ret;

repeat:
    ret = READ_ONCE(s->sequence);
    if (unlikely(ret & 1)) {
        cpu_relax();
        goto repeat;
    }
    return ret;
}

static inline unsigned raw_read_seqcount_begin(const seqcount_t *s)
{
    unsigned ret = __read_seqcount_begin(s);
    smp_rmb();
    return ret;
}

Basically, it uses a volatile read plus a read barrier with acquire semantics on the reader side.

When used, subsequent reads are unprotected:

struct Data {
    u64 a, b;
};

// ...
read_seqcount_begin(&seq);
int v1 = d.a, v2 = d.b;
// ...

rigtorp/Seqlock

RIGTORP_SEQLOCK_NOINLINE T load() const noexcept {
  T copy;
  std::size_t seq0, seq1;
  do {
    seq0 = seq_.load(std::memory_order_acquire);
    std::atomic_signal_fence(std::memory_order_acq_rel);
    copy = value_;
    std::atomic_signal_fence(std::memory_order_acq_rel);
    seq1 = seq_.load(std::memory_order_acquire);
  } while (seq0 != seq1 || seq0 & 1);
  return copy;
}

The load of data is still performed without an atomic operation or protection. However, an atomic_signal_fence with acquire-release semantics is added prior to the read, in contrast to the rmb with acquire semantics in Kernel.

Amanieu/seqlock (Rust)

pub fn read(&self) -> T {
    loop {
        // Load the first sequence number. The acquire ordering ensures that
        // this is done before reading the data.
        let seq1 = self.seq.load(Ordering::Acquire);

        // If the sequence number is odd then it means a writer is currently
        // modifying the value.
        if seq1 & 1 != 0 {
            // Yield to give the writer a chance to finish. Writing is
            // expected to be relatively rare anyways so this isn't too
            // performance critical.
            thread::yield_now();
            continue;
        }

        // We need to use a volatile read here because the data may be
        // concurrently modified by a writer.
        let result = unsafe { ptr::read_volatile(self.data.get()) };

        // Make sure the seq2 read occurs after reading the data. What we
        // ideally want is a load(Release), but the Release ordering is not
        // available on loads.
        fence(Ordering::Acquire);

        // If the sequence number is the same then the data wasn't modified
        // while we were reading it, and can be returned.
        let seq2 = self.seq.load(Ordering::Relaxed);
        if seq1 == seq2 {
            return result;
        }
    }
}

No memory barrier between loading seq and data, but instead a volatile read is used here.

Can Seqlocks Get Along with Programming Language Memory Models? (Variant 3)

T reader() {
  int r1, r2;
  unsigned seq0, seq1;
  do {
    seq0 = seq.load(m_o_acquire);
    r1 = data1.load(m_o_relaxed);
    r2 = data2.load(m_o_relaxed);
    atomic_thread_fence(m_o_acquire);
    seq1 = seq.load(m_o_relaxed);
  } while (seq0 != seq1 || seq0 & 1);
  // do something with r1 and r2;
}

Similar to the Rust implementation, but atomic operations instead of volatile_read are used on data.

Arguments in P1478R1: Byte-wise atomic memcpy

This paper claims that:

In the general case, there are good semantic reasons to require that all data accesses inside such a seqlock "critical section" must be atomic. If we read a pointer p as part of reading the data, and then read *p as well, the code inside the critical section may read from a bad address if the read of p happened to see a half-updated pointer value. In such cases, there is probably no way to avoid reading the pointer with a conventional atomic load, and that's exactly what's desired.

However, in many cases, particularly in the multiple process case, seqlock data consists of a single trivially copyable object, and the seqlock "critical section" consists of a simple copy operation. Under normal circumstances, this could have been written using memcpy. But that's unacceptable here, since memcpy does not generate atomic accesses, and is (according to our specification anyway) susceptable to data races.

Currently to write such code correctly, we need to basically decompose such data into many small lock-free atomic subobjects, and copy them a piece at a time. Treating the data as a single large atomic object would defeat the purpose of the seqlock, since the atomic copy operation would acquire a conventional lock. Our proposal essentially adds a convenient library facility to automate this decomposition into small objects.

My question

Which of the above implementations are correct? Which are correct but inefficient?
Can the volatile_read be reordered before the acquire-read of seqlock?

Correct according to which written specification or unwritten implicit set of assumptions? Technically a data race causes UB. With a volatile access you get the native CPU behavior for a data race (which is well defined). But I'd think mixing volatile and atomics and fences is well defined; in particular fences are defined as ordering atomics not volatile accesses. Stuff might happen to "work" that is be compiled to code that happens to provide guaranteed correct behavior (provable at the generated asm level), by accident, and might break that with a newer, better compiler. — curiousguy, Jun 03 '19 at 01:06
I'm think your question require a book so I vote as too broad, even if it's an interesting question — Stargateur, Jun 03 '19 at 11:13
@Stargateur Yeah... but I guess I'm going to publish my understanding before it gets closed — lz96, Jun 03 '19 at 17:48
@curiousguy when a question need an very long answer like the size of a book — Stargateur, Jun 03 '19 at 22:25
@Stargateur Then reference that book and cite its conclusion. If the book doesn't exist, write the book and get some publicity. :D — curiousguy, Jun 03 '19 at 23:17
The way I understand the memory model, **all** implementations are wrong (but Rigtorp's is the closest to being correct). An acquiring thread must see all writes that happened before a release [on the same variable] on another thread. Sequentially-consistent ordering imposes a total order. So... `seq_cst` for the first, and `release` for the second increment in `store()` and `acquire` in `load()` --- the rest needs not be atomic stuff, and no fences needed. Might be wrong, but this is how I understand it. — Damon, Jul 03 '19 at 18:33
@curiousguy "newer, better compiler" is an exaggeration, as compilers have no completely formalized and sound memory model to define what and how to apply their optimizations. You might even need an additional contract (abstracting/simplifying variable assumptions) to get optimal performance and soundness in all cases, because the corner-case detection could be technically too tricky. — Jay-Pi, Jan 07 '21 at 17:01
@Jay-Pi: ISO C++11 *did* introduce a formal thread-aware memory model. I think it's possible to write a seqlock that's valid within that model, e.g. by using `atomic` sized chunks with mo_relaxed, and a well-placed `atomic_thread_fence(mo_acquire)`. But that would probably defeat some optimizations, especially if the data is big enough to be worth using SIMD vectors to copy, so people instead abuse de-facto normal behaviour. — Peter Cordes, May 05 '21 at 17:21
Related: [Implementing 64 bit atomic counter with 32 bit atomics](https://stackoverflow.com/q/54611003) (my attempt at a SeqLock in C++, with comments on safety) / [GCC reordering up across load with \`memory\_order\_seq\_cst\`. Is this allowed?](https://stackoverflow.com/q/36958372) (gcc bug affecting seqlocks) / [Is the transformation of fetch\_add(0, memory\_order\_relaxed/release) to mfence + mov legal?](https://stackoverflow.com/q/64976395) ("N4455 no sane compiler would optimize atomics" SeqLock example) — Peter Cordes, May 05 '21 at 17:37

Support Ukraine · Answer 1 · 2021-05-05T16:32:24.250

Your qoutes from Linux seems wrong.

According to https://www.kernel.org/doc/html/latest/locking/seqlock.html the read process is:

Read path:

do {
        seq = read_seqcount_begin(&foo_seqcount);

        /* ... [[read-side critical section]] ... */

} while (read_seqcount_retry(&foo_seqcount, seq));

If you look at the github link posted in the question, you'll find a comment including nearly the same process.

It seems that you are only looking into one part of the read process. The linked file implements what you need to implement readers and writers but not the reader/writer them self.

Also notice this comment from the top of the file:

* The seqlock seqcount_t interface does not prescribe a precise sequence of
* read begin/retry/end. For readers, typically there is a call to
* read_seqcount_begin() and read_seqcount_retry(), however, there are more
* esoteric cases which do not follow this pattern.