Fast and Lock Free Single Writer, Multiple Reader

Question

I've got a single writer which has to increment a variable at a fairly high frequence and also one or more readers who access this variable on a lower frequency.

The write is triggered by an external interrupt.

Since i need to write with high speed i don't want to use mutexes or other expensive locking mechanisms.

The approach i came up with was copying the value after writing to it. The reader now can compare the original with the copy. If they are equal, the variable's content is valid.

Here my implementation in C++

template<typename T>
class SafeValue
{
private:
    volatile T _value;
    volatile T _valueCheck;
public:
    void setValue(T newValue)
    {
        _value = newValue;
        _valueCheck = _value;
    }

    T getValue()
    {
        volatile T value;
        volatile T valueCheck;
        do
        {
            valueCheck = _valueCheck;
            value = _value;
        } while(value != valueCheck);

        return value;
    }
}

The idea behind this is to detect data races while reading and retry if they happen. However, i don't know if this will always work. I haven't found anything about this aproach online, therefore my question:

Is there any problem with my aproach when used with a single writer and multiple readers?

I already know that high writing frequencys may cause starvation of the reader. Are there more bad effects i have to be cautious of? Could it even be that this isn't threadsafe at all?

Edit 1:

My target system is a ARM Cortex-A15.

T should be able to become at least any primitive integral type.

Edit 2:

std::atomic is too slow on reader and writer site. I benchmarked it on my system. Writes are roughly 30 times slower, reads roughly 50 times compared to unprotected, primitive operations.

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/186465/discussion-on-question-by-detonar-lock-free-single-writer-multiple-reader). — Samuel Liew, Jan 10 '19 at 10:59
Do you really need to _wite_ at high speed? Couldn't you increment a writer-private variable, and then just add it to the reader-accessible total _periodically_, as opposed to having to increment the total by 1 _frequently_? In other words, can the readers tolerate a little bit of staleness? — Branko Dimitrijevic, Jan 10 '19 at 11:36
@Branko Dimitrijevic - Not much. The counter is used to track the position of an stepper motor. The reader requires to get this motor's position as exact as possible. — Detonar, Jan 10 '19 at 11:40
@Detonar I would recommend to edit the question to clearly explain in details why atomics are not suitable for you. As you write in the comment below: _The problem with `std::atomic` is that internal system calls are greatly slowing it down._ What _internal system calls_? Atomic operations should boil down to special instructions only if atomics are supoorted by hardware for a given data type (is it your case or not?). What do you mean by _slowing it down_? Slowing producer? Slowing consumers? How do you measure such slowing? — Daniel Langr, Jan 10 '19 at 12:12
@Daniel Langr - I benchmarked it on the target system. Edited my question. — Detonar, Jan 10 '19 at 12:29
@Detonar Thanks. I am still not sure whether I understand the whole problem, since when I compare the generated assembly of your solution with one that uses `std::atomic` (for `int` type, ARM, and GCC 8.2), the latter one is much simpler and should be more efficient: https://godbolt.org/z/lpvFQB. BTW, how do you measure the slowdown? — Daniel Langr, Jan 10 '19 at 14:47
Just out of curiosity, I also tried to substitute `volatile` in your soluton by `atomic`s: https://godbolt.org/z/tl4wdB. The generated assembly is very similar and I doubt there would be any such significant performance difference (note that there is no reason for those local variables in `getValue` to be `volatile`). — Daniel Langr, Jan 10 '19 at 15:00
You benchmarked `std::atomic` with what ordering constraint? The default sequential consistency? Did you profile with release/acquire or release/consume ordering? Did you look at the generated assembly for those cases? — Useless, Jan 10 '19 at 16:25
What you are doing is very similar to what I have seen called "sequence lock". Here is another thread with some details. https://stackoverflow.com/questions/48749643/is-volatile-required-here — ttemple, Jan 12 '19 at 12:44

selbie · Answer 1 · 2019-01-10T10:23:59.460

5

Is this single variable just an integer, pointer, or plain old value type, you can probably just use std::atomic.

edited Jan 10 '19 at 10:23

answered Jan 10 '19 at 10:02

selbie

100,020
15
103
173

1

`std::atomic` is too slow, even then lock-free which is not always guaranteed. – Detonar Jan 10 '19 at 10:05
2

@Detonar Your claims make no sense. Study the assembly generated by `std::atomic`. – Maxim Egorushkin Jan 10 '19 at 10:38
@MaximEgorushkin Not really if you read the question. He just wants to avoid the CAS-loop. The problem he is trying to solve CAN be solved faster than `std::atomic` as his task is much more specific. – Jan 10 '19 at 10:39
@Ivan There is one writer only, the shared variable is `int`, no `CAS` loop is necessary. – Maxim Egorushkin Jan 10 '19 at 10:44
@MaximEgorushkin Reading 8-byte values on 32-bit platforms is implemented using CAS-loop e.g. will fail on read-only pages for example. I think he wants to speed-up reader side. The problem is that his code is not "lock-free" at all. – Jan 10 '19 at 10:45
@Ivan Where did you pull numbers 8 and 32 from? – Maxim Egorushkin Jan 10 '19 at 10:47
@MaximEgorushkin From his comments. Although there is a lengthy 30-comment discussion now. He needs this to work for 8 byte values on 32-bit ARM. – Jan 10 '19 at 10:48
@Ivan: The OP's Cortex-A15 (ARMv7) doesn't have CAS in hardware, it has LL/SC. And it can do an atomic 64-bit load or store using LDREXD, or LDREXD+STREXD. Are you maybe thinking of 32-bit x86 with `cmpxchg8b`? – Peter Cordes Jan 10 '19 at 15:15
@PeterCordes GCC generates CAS (`__sync_val_compare_and_swap_8` call) to read 8 byte value for 32-bit ARM - [link](https://godbolt.org/z/DH8J7S). I think you will need both LDREXD and STREXD to implement it. Similar is done for x86, but it will be 8-byte `cmpxchg` instruction, yes. – Jan 10 '19 at 15:47
@Ivan: You forgot to tell the compiler it was targeting an ARMv7-a CPU, specifically `-mcpu=cortex-a15`. https://godbolt.org/z/q-S5x1. In your link, gcc was compiling for any generic ARM, and had to emit code that worked on anything, including ARMv5 and earlier before LDREX / STREX even existed, and before `ldrd` was guaranteed atomic anywhere, so it couldn't inline them. – Peter Cordes Jan 10 '19 at 16:07
@PeterCordes Indeed, I have checked ARM specs on that. Surprise that by default such slow implementation is used. Linux doesn't use CAS at all when implementing 64-bit accesses. – Jan 10 '19 at 16:16

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

You should try using std::atomic first, but make sure that your compiler knows and understands your target architecture. Since you are targeting Cortex-A15 (ARMv7-A cpu), make sure to use -march=armv7-a or even -mcpu=cortex-a15.

The first shall generate ldrexd instruction which should be atomic according to ARM docs:

Single-copy atomicity

In ARMv7, the single-copy atomic processor accesses are:

all byte accesses

all halfword accesses to halfword-aligned locations

all word accesses to word-aligned locations

memory accesses caused by LDREXD and STREXD instructions to doubleword-aligned locations.

The latter shall generate ldrd instruction which should be atomic on targets supporting Large Physical Address Extension:

In an implementation that includes the Large Physical Address Extension, LDRD and STRD accesses to 64-bit aligned locations are 64-bit single-copy atomic as seen by translation table walks and accesses to translation tables.

--- Note ---

The Large Physical Address Extension adds this requirement to avoid the need to complex measures to avoid atomicity issues when changing translation table entries, without creating a requirement that all locations in the memory system are 64-bit single-copy atomic.

You can also check how Linux kernel implements those:

#ifdef CONFIG_ARM_LPAE
static inline long long atomic64_read(const atomic64_t *v)
{
    long long result;

    __asm__ __volatile__("@ atomic64_read\n"
"   ldrd    %0, %H0, [%1]"
    : "=&r" (result)
    : "r" (&v->counter), "Qo" (v->counter)
    );

    return result;
}
#else
static inline long long atomic64_read(const atomic64_t *v)
{
    long long result;

    __asm__ __volatile__("@ atomic64_read\n"
"   ldrexd  %0, %H0, [%1]"
    : "=&r" (result)
    : "r" (&v->counter), "Qo" (v->counter)
    );

    return result;
}
#endif

score 2 · Answer 3 · answered Jan 10 '19 at 11:18

2

There's no way anyone can know. You would have to see if either your compiler documents any multi-threaded semantics that would guarantee that this will work or look at the generated assembler code and convince yourself that it will work. Be warned that in the latter case, it is always possible that a later version of the compiler, or different optimizations options or a newer CPU, might break the code.

I'd suggest testing std::atomic with the appropriate memory_order. If for some reason that's too slow, use inline assembly.

answered Jan 10 '19 at 11:18

David Schwartz

179,497
17
214
278

I maily wanted to focus on the algorithm rather than language specific details. And how would using inline assembly provide a viable performance increase? The problem with `std::atomic` is that internal system calls are greatly slowing it down. – Detonar Jan 10 '19 at 11:25
@Detonar The algorithm is so simple here that it's all down to the specifics of implementation. Using inline assembly would provide a viable performance increase because it would allow you to be 100% sure you were getting what you were asking for so you wouldn't need to code defensively. Defensive coding will likely cost you. – David Schwartz Jan 10 '19 at 11:46

Maxim Egorushkin · Answer 4 · 2019-01-10T11:40:28.000

Another option is to have a buffer of non-atomic values the publisher produces and an atomic pointer to the latest.

#include <atomic>
#include <utility>

template<class T>
class PublisherValue {
    static auto constexpr N = 32;
    T values_[N];
    std::atomic<T*> current_{values_};

public:
    PublisherValue() = default;
    PublisherValue(PublisherValue const&) = delete;
    PublisherValue& operator=(PublisherValue const&) = delete;

    // Single writer thread only.
    template<class U>
    void store(U&& value) {
        T* p = current_.load(std::memory_order_relaxed);
        if(++p == values_ + N)
            p = values_;
        *p = std::forward<U>(value);
        current_.store(p, std::memory_order_release); // (1) 
    }

    // Multiple readers. Make a copy to avoid referring the value for too long.
    T load() const {
        return *current_.load(std::memory_order_consume); // Sync with (1).
    }
};

This is wait-free, but there is a small chance that a reader might be de-scheduled while copying the value and hence read the oldest value while it has been partially overwritten. Making N bigger reduces this risk.

Fast and Lock Free Single Writer, Multiple Reader

4 Answers4

Linked