1

The use case is this:

  • One thread writes data, using SetData()
  • Multiple threads may read data, using GetData()
  • I plan to only use basic data types or structs as T

With these rules in mind, is this thread safe?

    public struct ThreadSafeData<T>
    {
        private T[] dataArray;

        private int setterIndex;        
        private int lastSetIndex;

        public void Init()
        {
            dataArray = new T[2];
        }

        public void SetData(T data)
        {
            dataArray[setterIndex] = data;            
            lastSetIndex = setterIndex;

            // Convert 0 to 1, and 1 to 0
            setterIndex = lastSetIndex * - 1 + 1;
        }

        public T GetData()
        {            
            return dataArray[lastSetIndex];
        }
    }

UPDATE (requested in comments)

What I would like to achieve is the following; I want to avoid tearing and want the reader to always read the last value written by the writer. I tried doing that with a single T field, but then I encountered (what I think is) tearing. For example, in the tests (see below) I always write a Vector2Int with 0,0 or 1,1. But the reader would sometimes read 1,0 when using a single T field. This is why I added the Array (and added the "data integrity" check to my tests).

I am using X64 architecture. And this is the Vector2Int I use in my tests: https://docs.unity3d.com/ScriptReference/Vector2Int.html

Questions

  1. How do I know if this is thread safe (if it is)? I have run tests for quite a while. But how do I know for sure?

  2. Do you know a better solution for this use case? Please let me know!

Tests

I am making a game in Unity and have run tests where the "writing thread" runs at 30, 60 or 90fps, and doing up to 300 writes per frame. And a "reading thread" running from 30 to 300fps (doing 1 read per frame).

The test data (T) I used was a struct with a Vector2Int and a bool. To check the data integrity, the "reader" checked if the x and y of the Vector2Int are 1 when the bool is true and when false, x and y have to be 0 (it throws an error when this is wrong).

I ran these test for about an hour and never got any errors. But I am not sure if that means that this always works correct.

(ps. I don't really care if this template is a struct or class; I am not sure yet what will work best for me)

joram
  • 143
  • 7
  • Unity seems to be irrelevant to the question, it's all about C# memory semantics. – Charlieface Mar 06 '23 at 21:06
  • 4
    No this is absolutely not thread-safe, for multiple reasons: cache lines are not necessarily coherent across cores, except in x86. Even then there will usually be data cached in registers. Without acquire/release semantics (memory barriers) there is nothing to stop a thread reordering instructions, which can cause incorrect results from the perspective of other threads. Etc – Charlieface Mar 06 '23 at 21:10
  • 4
    What is the intended behavior of this class? It seems that the readers are always reading the latest `T` value written by the writer. If this is the intended behavior, then why the writer writes to an array, alternating between the two slots of the array, instead of writing directly into single `T` field? Is this a trick intended as a mechanism that ensures atomicity, in other words you are hoping to prevent [tearing](https://stackoverflow.com/questions/9008454/simulate-tearing-a-double-in-c-sharp) of the `T` value that the readers are reading? – Theodor Zoulias Mar 06 '23 at 22:41
  • Agree with @TheodorZoulias If the intent is to always read the latest value, what is the point? Threads will *always* read the latest value. Thread safety is all about when you want to *not* read the latest value. – marsze Mar 06 '23 at 23:18
  • @Charlieface: All multi-core systems that a single OS will run threads across have coherent cache; that's not unique to x86. See https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/ That's why the Linux kernel can roll its own atomics with inline asm and C `volatile` (which unlike C# has no ordering, just guarantees that a load or store will happen in the asm.) Not that I recommend that for new C projects, but see [this](https://stackoverflow.com/questions/4557979/when-to-use-volatile-with-multi-threading/58535118#58535118) if you're curious how HW works. – Peter Cordes Mar 07 '23 at 02:21
  • 2
    @marsze: Threads will always read the current value that's visible to them. But calling that the "latest" is asking for trouble; defining that concept at all can be problematic, and requires understanding of the fact that multiple threads can be executing store instructions on a location, which will eventually commit to coherent cache in some order (the modification order of a variable), and that stores executed but not yet drained from the store buffer aren't visible. (And that non-`volatile` assignments can get optimized out of loops, using registers in the loop with one store at the end.) – Peter Cordes Mar 07 '23 at 02:28
  • 1
    @marsze And of course there's the problem that without any ordering guarantees between memory accesses one thread makes, "current" for one variable might be earlier than the time some earlier statement in the C# code read a different shared variable. Anyway, "latest value" is [something people often get hung up on](https://stackoverflow.com/questions/53032354/does-atomic-read-guarantees-reading-of-the-latest/73170405#73170405); if you use `volatile` or `Volatile.Write`, and `Volatile.Read`, values will be visible quickly between threads, as fast as the HW can do it, just worry about order – Peter Cordes Mar 07 '23 at 02:32
  • 1
    So I guess this is a bad idea. Thanks for all the responses :) @TheodorZoulias, yes, I want to avoid tearing and want the reader to always read the last value written by the writer. I tried doing that with a single T field, but then I encountered (what I think is) tearing. For example, in the tests I always write a Vecto2Int with 0,0 or 1,1. But the reader would sometimes read 1,0 when using a single T field. This is why I added the "data integrity" check to my tests. Could it work with a single T field? – joram Mar 07 '23 at 07:11
  • @joram can you please show `Vector2Int`? Also what architecture are you using, x32, x64? – Guru Stron Mar 07 '23 at 07:14
  • @GuruStron, sure, it's Unity's Vector2Int, so this is all I know: https://docs.unity3d.com/ScriptReference/Vector2Int.html. The architecture is x64 – joram Mar 07 '23 at 07:16
  • 2
    @TheodorZoulias question updated! Thanks for the advice. – joram Mar 07 '23 at 07:35

2 Answers2

3

I managed to reproduce a torn value with your ThreadSafeData implementation, using as T the type ValueTuple<int, int, int, int, int, int, int, int, int>, and mutating it using the same Random.Next() value for all nine fields of the tuple. Demo. Hundreds of torn T values per second are observed. Like this value:

(373331022, 373331022, 373331022, 373331022, 373331022, 373331022, 373331022, 1480972221, 1480972221)

Here are some alternative implementations of the ThreadSafeData<T> struct, that are truly safe. Using the lock statement is definitely thread-safe, and also very simple:

public struct ThreadSafeData<T>
{
    private readonly object _locker = new();
    private T _data = default;
    public ThreadSafeData() { }
    public void SetData(T data) { lock (_locker) _data = data; }
    public T GetData() {  lock (_locker) return _data; }
}

The cost of an uncontended lock is in the magnitude of ~20 nanoseconds, so it's quite cheap, but if your readers are calling the GetData very frequently then you might want a faster solution. This solution is not the ReaderWriterLockSlim though:

public struct ThreadSafeData<T>
{
    private readonly ReaderWriterLockSlim _lock = new();
    private T _data = default;
    public ThreadSafeData() { }
    public void SetData(T data)
    {
        _lock.EnterWriteLock();
        try { _data = data; }
        finally { _lock.ExitWriteLock(); }
    }
    public T GetData()
    {
        _lock.EnterReadLock();
        try { return _data; }
        finally { _lock.ExitReadLock(); }
    }
}

This is actually a little slower than the lock, because the work that is performed by the writer is too lightweight. The ReaderWriterLockSlim is advantageous when the writer does chunky work.

A better alternative regarding performance, but worse regarding memory allocations, is to store the T value in a volatile object field:

public struct ThreadSafeData<T>
{
    private volatile object _data = default(T);
    public ThreadSafeData() { }
    public void SetData(T data) => _data = data;
    public T GetData() => (T)_data;
}

This will cause boxing in case the T is a value type, and a new box will be allocated on each SetData operation. According to my experiments the size of the box is 16 bytes + the size of the T. The performance boost compared to the lock (in scenarios with high contention), is around x10.


Seqlock: Peter Cordes mentioned in their answer the interesting idea of the Seqlock. Below is an implementation of this idea. My tests don't reveal any tearing. Most likely the implementation below is safe for a single writer - multiple readers scenario, but I wouldn't bet my life on it. The advantage of this approach over the above volatile object, is that it doesn't allocate memory.

public struct ThreadSafeData<T> // Safe with a single writer only
{
    private volatile int _seq;
    private T _data;

    public void SetData(T data)
    {
        _seq++;
        Interlocked.MemoryBarrier();
        _data = data;
        _seq++;
    }

    public T GetData()
    {
        SpinWait spinner = default;
        while (true)
        {
            int seq1 = _seq;
            if ((seq1 & 1) != 0) goto spin;
            T data = _data;
            Interlocked.MemoryBarrier();
            int seq2 = _seq;
            if (seq1 != seq2) goto spin;
            return data;
        spin:
            spinner.SpinOnce();
        }
    }
}

According to my experiments the performance of the GetData() is about half of the corresponding volatile object-based implementation, but still many times faster than the lock-based, under the condition that the SetData is called infrequently. Otherwise, if the writer calls the SetData in a tight loop, the readers will barely be able to read any value at all. Almost always the _seq will be different before and after reading the _data, resulting in endless spinning.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • 1
    Okay. This is really impressive. You actually _proved_ what I did isn't thread safe _and_ showed me a working example. Thank you very much! I also would like to thank all the other people that responded. I had my doubts posting this, but I am happy I did. It thought me a lot about the complexity of multi threading (and tearing). Thanks again! – joram Mar 07 '23 at 18:26
  • 1
    Nice, your final version using `volatile object _data` looks like what I was talking about in the section of my answer talking about RCU. Garbage-collected languages like Java and C# solve the deallocation problem, making that pattern vastly easier to use for real than in C++. It's good especially when writes are infrequent but reads are common. – Peter Cordes Mar 07 '23 at 20:27
  • @joram's original idea of a small array and an index to the currently valid one could also work if writes are infrequent enough that even slow readers will almost certainly complete before the writer wraps around to use that slot again. Hmm, and maybe doing tear detection by using the upper bits of the index as a sequence number; so readers do `int idx = foo.idx;` / copy `foo.data[idx % 8]` / retry if `idx != foo.idx` (where `foo.idx` is `volatile` so those two reads aren't optimized out; but probably also needs a LoadLoad barrier between the copy and the 2nd read, like a seqlock) – Peter Cordes Mar 07 '23 at 20:30
  • 1
    Cool, and good plan to test with infrequent writes since that's the use-case where SeqLock is appropriate. You only need one `Interlocked.MemoryBarrier();` per read attempt; a `volatile` read has sufficient ordering to keep the first read of `_seq` ahead of the read of `_data`. It's only the LoadLoad ordering of `_data` before the second `_seq` load that needs an extra barrier, sort of like an acquire load of `_data`. Same deal for the writer only needing one barrier, before the non-volatile write of `_data`. – Peter Cordes Mar 13 '23 at 16:49
  • @PeterCordes so is it safe to remove the first of the two `Interlocked.MemoryBarrier` calls in the `GetData` method? – Theodor Zoulias Mar 13 '23 at 16:51
  • Yes, that was my point. And the second barrier in `SetData`. Also, if `_seq++` is an atomic RMW on a C# `volatile`, that's unnecessary because you only have a single writer. Read it once, and store `tmp+1` then `tmp+2`. Probably best to do that anyway, since you don't need a volatile re-read of `_seq` after storing `_data`. (If you were using `_seq` as a spinlock between multiple writers, you'd do it with `lock bts` or `lock cmpxchg` (Interlocked.Or or a CAS?) to set the low bit to claim the lock as a writer, then store the `tmp+2` value to unlock, or use a separate mutex.) – Peter Cordes Mar 13 '23 at 16:55
  • `(seq2 & 1) != 0` is redundant; it can't be equal to `seq1` if that's true because we already bailed if `seq1` was odd. – Peter Cordes Mar 13 '23 at 16:57
  • @PeterCordes answer updated. TBH for my level of knowledge this approach it's like walking on an invisible tightrope! – Theodor Zoulias Mar 13 '23 at 17:00
  • It can help to look at the asm, perhaps? And/or my answer on [Implementing 64 bit atomic counter with 32 bit atomics](https://stackoverflow.com/q/54611003) has detailed comments and explanation about why `atomic_thread_fence(std::memory_order_acquire);` is needed at once place, but `seqcount.load(std::memory_order_acquire);` (aka `volatile` read) can be used at the other place. – Peter Cordes Mar 13 '23 at 17:03
  • @PeterCordes reading might help, but having the approval of an expert is more reassuring. :-) – Theodor Zoulias Mar 13 '23 at 17:05
  • 1
    Fair enough, I don't mind taking a look at stuff like this, especially since I'd been hoping in my own answer that someone would take a stab at implementing a SeqLock. Just wanted to point you at additional things you could have done if I hadn't been online to answer right away. – Peter Cordes Mar 13 '23 at 17:32
  • @PeterCordes btw I would like your opinion about this code: `while (Interlocked.CompareExchange(ref _lock, 1, 0) != 0) Thread.Yield(); try { /* Do stuff */ } finally { _lock = 0; }`. Is the `/* Do stuff */` protected from concurrent access? The `_lock` is a non-volatile `int` field. I found [code like this](https://github.com/alastairtree/LazyCache/blob/633b5702ddbc3e013aa9a3893d7b4dc0c807943f/LazyCache/CachingService.cs#L120) in the quite popular LazyCache library, and I am highly skeptical about its correctness. What's the worst that could happen by zeroing the `_lock` without fences? – Theodor Zoulias Mar 14 '23 at 01:08
  • 1
    @TheodorZoulias: Looks buggy to me; the `_lock = 0` is not a release store so part of `Do stuff` can reorder out of the critical section, potentially becoming visible only after another thread acquires the lock. It will very likely compile to correct asm on x86, where all asm stores have "release" semantics so only compile-time reordering could break things, but not on ARM / AArch64 or other mainstream ISAs. A basic spinlock needs an acquire RMW to get exclusive ownership, and a release store to unlock, hence the names. That's sufficient to keep `Do stuff` contained inside the critical section – Peter Cordes Mar 14 '23 at 02:33
  • @PeterCordes I just posted a question about this [here](https://stackoverflow.com/questions/75728657/is-it-safe-to-use-the-interlocked-compareexchange-as-a-lock-only-for-enter-not). :-) – Theodor Zoulias Mar 14 '23 at 02:41
  • 1
    BTW, your answer doesn't highlight the fact that both the RCU (`volatile object _data`) and SeqLock strategies have perfect scaling for more parallel readers. The readers are truly read-only so there's no contention between them. This is one of the biggest qualitative differences from any kind of locking; even a readers-writers lock still needs every reader to do an atomic RMW on some shared memory location (the lock itself). vs. truly read only they can all hit in their private L1d caches if a recent write hasn't invalidated them. – Peter Cordes Mar 15 '23 at 16:34
  • This is why it's "many times" faster than `lock` for a multi-readers tests when you have multiple read threads reading in a tight loop, the best case to highlight the advantages of SeqLock or RCU. – Peter Cordes Mar 15 '23 at 16:36
2

2 int elements is only 64 bits; mainstream x86 and ARM CPUs can do 64-bit atomic stores and load (even when running in 32-bit mode), so the optimal approach would be convincing your compiler to do that. (Like C++ std::atomic<my_struct> with memory_order_relaxed, or in C# perhaps just packing both members into a uint64_t, if you can get the necessary visibility guarantees for your code as well as atomicity.)

Remember that plain variables don't guarantee visibility across threads unless there's other synchronization: reading a variable in a loop can optimize to reading it into a register once, and using that value every time. Any time a load or store does happen, it's atomic, but without volatile or Volatile.Read / .Write, nothing guarantees that loads / stores happen in the asm as often as they do in the source.


In general, you might be looking for the SeqLock, where the reader retries until it gets a non-torn read of the payload + sequence number. See

If only one thread ever writes, you don't need any atomic RMWs, just atomic stores of the sequence number before/after plain stores of the payload. But you do need strict ordering of the stores in the writer and loads in the reader. And the first load of the sequence number does have to be an acquire load, like Volatile.Read.

I ran these test for about an hour and never got any errors.

If you tested on x86, LoadLoad and StoreStore reordering are impossible, so if your algorithm is equivalent enough to a SeqLock, then it will be safe on x86 if the compiler doesn't reorder operations at compile time. Having two separate "index" (sequence number?) variables would stop the compiler from doing dead-store elimination. Using just plain C# vars without anything equivalent to C++ std::atomic_thread_fence(acquire) / release will break on AArch64, though, were plain variables get weaker run-time ordering than on x86.


Your implementation isn't really like a SeqLock, it's like a 2-element queue depending on a memory_order_consume type of dependency-ordering which is a thing even on weakly-ordered ISAs like AArch64. But if another write happens while a reader is still reading, you'll spoil their data.

So it's like RCU (read copy update), with its advantage over a SeqLock that there's always a consistent state to read. But you reuse the old state on the very next write, so even if you write infrequently, a reader that gets descheduled by the OS could wake up and read inconsistent data if that happens at the same time the writer is writing the same slot it's reading.

RCU can work well in a garbage-collected language like C#, if you allocate a new object and update a reference on write. Garbage collection solves the deallocation problem, which is the hardest part of RCU (knowing when all other threads definitely aren't still reading an old copy of the payload).

You're replacing the whole payload, so not actually copying and modifying e.g. an array or binary tree, so that makes it even easier.


P.S. The only part of C# I know is some memory-model stuff and SIMD intrinsics, because that's where the tags I follow overlap with it. That's why my answer is basically citing C++ equivalents for what's possible. I never planned to write a real answer to this, but my comments kind of became almost an answer.

If someone else wants to write a C# SeqLock implementation or example, please post it as an answer here or on a related C# question.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • *"IDK if spin.SpinOnce(); has any compile-time-memory-barrier semantics"* -- AFAIK no. The `SpinWait` just spins a few times and then yields execution to another thread (selected by the OS). Or yields immediately if the program runs on a single-core machine. I tried to implement a `ThreadSafeData` version based on the `SeqLock` idea, and I failed. Whatever I tried, I couldn't prevent the readers from seeing torn `T` values. – Theodor Zoulias Mar 07 '23 at 19:45
  • @TheodorZoulias: You mean your reader was seeing torn `T` values *without detecting a sequence-number mismatch*? The SeqLock idea is that the reader can see torn `T` values, but can detect when that might have happened and retry, so your read function should be able to avoid returning with a torn `T`. This does require ordering a second load *after* the load of `T`, but `Volatile.Read` is only an acquire load; it can reorder with earlier non-acquire loads. That's why in C++ you need something like `std::atomic_thread_fence(acquire)` to be safe on non-x86 ISAs. IDK how possible it is in C#. – Peter Cordes Mar 07 '23 at 20:35
  • I added two `volatile int` "guards" before and after the `T` field. The writer updated the head guard, then the data, then the tail guard. Head and tail the same value. The readers were reading the three fields in the same order, and were spinning until they read the same head and tail. It didn't worked. Tons of torn values. Adding memory barriers (`Interlocked.MemoryBarrier()`) between writes and reads made no difference. – Theodor Zoulias Mar 07 '23 at 22:10
  • Basically [this](https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0ATEBqAPgAQAYACfARgDoAVAC1gEMMBLAOwHMBuAWAChf8AzMQDOAFygBXMKOK0GGAMr0AZjAAi9UfQA8VAHy8A3r2KniAByhMAbppjFrEADaamT+6xkB9GjEbceMwsrW1F7KmIvDE16AKDLGzsHZ1d3Yk9IrTc4s0FSFGIFGFENLQAKCOitAEoTM2NAoLMMlhgEGQBeSN9GYmxiMhym7r8MYi7W9qGmqJjx4irYuuGvLKd5ydFpgF9l0iEIgHFi0voy2sbTBuHTBXNWAHV6Jhlhe5ZWqHmMGGV6CScWz2QQA7jQ3PYyuIJDALjdiNd4c0WDIemMuj5RtN4ZU5hjFtibkxlMQymjxhi1tVSAB2BYxQnDN6sT4UO6sADyLDAMHOjOIu0uAt4gqAA). – Theodor Zoulias Mar 07 '23 at 22:17
  • 1
    @TheodorZoulias: Your algorithm is broken, allowing tearing even with sequentially-consistent execution (no reordering, just an unlucky interleaving of source order). Writer writes `_head` and part of `_data`, reader reads `_head` and *all* of `_data` (torn read). Then writer finishes storing `_data` and `_tail`. Then reader reads the matching `_tail` value and doesn't detect tearing. A SeqLock checks two reads of a single sequence counter for both `first_read == second_read`, and that `first_read & 1 == 0`. That 2nd check is what lets it detect that it started reading mid-update. – Peter Cordes Mar 07 '23 at 22:35
  • 1
    @TheodorZoulias: On x86, it would work to just have two writes (or two reads) of one `volatile int seq`, assuming the compiler didn't do compile-time reordering of loads and stores. The non-volatile access to `_data` can't reorder in either direction at run-time on x86, because of the strong memory model. But on AArch64, you'd need an acquire (or stronger) fence because the `_data` access itself wouldn't be an acquire or release operation. e.g. in the reader, `T data = _data;` could read shared memory after `int second_seq = seq;`, because volatile read (acquire load) is only ordered 1 way. – Peter Cordes Mar 07 '23 at 22:40
  • 1
    @TheodorZoulias: If C# doesn't have a way to get only an acquire fence, you would presumably have to use `Interlocked.MemoryBarrier()` which is a full barrier. So readers would be less cheap than your RCU-style version that just updates a reference to a new object that isn't mutated. BTW, this problem doesn't go away with two separate `_head` and `_tail` variables, your version would *also* have this problem on non-x86, as well as the problem that can happen even without memory reordering. – Peter Cordes Mar 07 '23 at 22:43
  • Yep, nice explanation why it doesn't work. To be honest I tried this without having read first the [SeqLock](https://en.wikipedia.org/wiki/Seqlock) Wikipedia article. No wonder it failed. I might try it again later, and see what happens. :-) – Theodor Zoulias Mar 08 '23 at 07:42
  • 1
    It worked! I updated my answer with a Seqlock-based implementation, including some observations about performance. – Theodor Zoulias Mar 13 '23 at 16:45