Partial-compare-and-full-swap for atomic values

Question

The problem is as follows.

Given a POD object that has two parts: index and data. I want to perform an atomic conditional exchange operation over it with a condition that checks equality for the index only.

Something like this:

struct Data { size_t m_index; char m_data; };
std::atomic<Data> dd; // some initialization
Data zz; // some initialization

// so I want something like this
dd.exchange_if_equals<&Data::m_index>(10,zz);

So this is a sort of "partial-compare-and-full-swap" operation. Maybe this will require an appropriate alignment for Data::m_index.

Obviously std::atomic doesn't support this, but can one implement this by his own, or maybe there is another library that supports this?

I would probably do a read, then your custom condition, then a compare-and-swap, where the comparison is that the current value is totally equal to the read value. If the last step fails, loop. — Mooing Duck, May 27 '20 at 21:11
Hardware CAS (like x86-64 or ARMv8.1) doesn't support this in asm, you'd have to roll your own. An LL/SC machine like ARM or PowerPC could do this easily in assembly, but there are no libraries that expose that portably. (Most importantly because it couldn't compile for machines like x86.) — Peter Cordes, May 27 '20 at 22:14

Mooing Duck · Accepted Answer · 2020-05-27T22:04:36.753

2

I think you have to do a load, then your custom condition, then a compare-and-swap, where the comparison is that the current value is totally equal to the read value. If the last step fails, loop.

template<class T, class F>
bool swap_if(std::atomic<T>& atomic, T desired, F&& condition) {
    for (;;) {
        T data = atomic.load();
        if (!condition(data)) break;
        if (atomic.compare_exchange_weak(data, desired)) return true;
    }
    return false;
}

http://coliru.stacked-crooked.com/a/a394e336628246a9

Due to the complexity, you should probably just use a mutex. Separately, std::atomic<Data> may already be using a mutex under the covers, since Data is so big.

edited May 27 '20 at 22:04

answered May 27 '20 at 21:19

Mooing Duck

64,318
19
100
158

I was hoping to get some advantage of the fact that my condition is not so "custom" it is just bitwise equality for some section of the data. – Vahagn May 27 '20 at 21:28
Just using mutex for Data might be even better rather than having a "hard-bitten" function like that, don't you think so? – Vahagn May 27 '20 at 21:36
1

@Vahagn: Usually, yes. Depends on how often this is called compared to other mutations. Ironically due to the size of `Data`, the `std::atomic` may already be using mutexes under the covers, so definitely consider it. – Mooing Duck May 27 '20 at 22:02
@Vahagn: If you can keep the size of `Data` down to 8 bytes total, you can get the zero-contention read-only scalability of lock-free `std::atomic`. Or with hacks, you can get good read performance for just the the index or data member separately, if in some cases you don't need to atomically read both as a pair. This partial-CAS is not particularly more expensive than any other operation std::atomic provides directly, at least in the low contention case where it will succeed most of the time. – Peter Cordes May 30 '20 at 03:44

Peter Cordes · Answer 2 · 2020-05-28T05:58:30.417

Just like C++, hardware CAS (like x86-64 or ARMv8.1) doesn't support this in asm, you'd have to roll your own.

In C++, it's fairly easy: load the original value and replace part of it. This can of course introduce spurious failure if another core changed the other part that you didn't want to compare against.

If possible use unsigned m_index instead of size_t, so the whole struct can fit in 8 bytes on typical 64-bit machines, instead of 16. 16-byte atomics are slower (especially the pure-load part) on x86-64, or not even lock-free at all on some implementations and/or some ISAs. See How can I implement ABA counter with c++11 CAS? re: x86-64 lock cmpgxchg16b with current GCC/clang.

If each atomic<> access separately takes a lock, it would be vastly better to just take a mutex around the whole custom compare and set.

I wrote a simple implementation of one CAS attempt (like cas_weak) as an example. You could maybe use it in a template specialization or derived class of std::atomic<Data> to provide a new member function for atomic<Data> objects.

#include <atomic>
struct Data {
    // without alignment, clang's atomic<Data> doesn't inline load + CAS?!?  even though return d.is_always_lock_free; is true
    alignas(long long)  char m_data;
    unsigned m_index;               // this last so compilers can replace it slightly more efficiently
};

inline bool partial_cas_weak(std::atomic<Data> &d, unsigned expected_idx, Data zz, std::memory_order order = std::memory_order_seq_cst)
{
    Data expected = d.load(std::memory_order_relaxed);
    expected.m_index = expected_idx;            // new index, same everything else
    return d.compare_exchange_weak(expected, zz, order);
    // updated value of "expected" discarded on CAS failure
    // If you make this a retry loop, use it instead of repeated d.load
}

This compiles nicely in practice with clang for x86-64 (Godbolt), inlining into a caller that passes a compile-time-constant order (else clang goes berserk branching on that order arg for a stand-alone non-inline version of the function)

# clang10.0 -O3 for x86-64
test_pcw(std::atomic<Data>&, unsigned int, Data):
    mov     rax, qword ptr [rdi]                  # load the whole thing
    shl     rsi, 32
    mov     eax, eax                              # zero-extend the low 32 bits, clearing m_index
    or      rax, rsi                              # OR in a new high half = expected_idx
    lock            cmpxchg qword ptr [rdi], rdx      # the actual 8-byte CAS
    sete    al                                        # boolean FLAG result into register
    ret

Unfortunately compilers are too dumb to only load the part of the atomic struct they actually need, instead loading the whole thing and then zeroing out the part they didn't want. (See How can I implement ABA counter with c++11 CAS? for union hacks to work around that on some compilers.)

Unfortunately GCC makes messy asm that stores/reloads temporaries to the stack, leading to a store-forwarding stall. GCC also zeros the padding after char m_data (whether it's the first or last member), possibly leading to CAS always failing if the actual object in memory had non-zero padding. That might be impossible if pure stores and initialization always make it zero.

An LL/SC machine like ARM or PowerPC could do this easily in assembly (the compare/branch is done manually, between the load-linked and the store-conditional), but there are no libraries that expose that portably. (Most importantly because it couldn't compile for machines like x86, and because what you can do in an LL/SC transaction is severely limited and debug-mode spill/reload of local vars could result in code that always failed.)

Ted Lyngmo · Answer 3 · 2020-05-28T23:32:29.077

0

If using a std::mutex instead of atomic is an option you could put the mutex in your own atomic-like wrapper.

Here's a beginning of what it could look like:

#include <iostream>
#include <type_traits>
#include <mutex>

template<typename T>
class myatomic {
public:
    static_assert(
        // std::is_trivially_copyable_v<T> && // used in std::atomic, not needed here
        std::is_copy_constructible_v<T> &&
        std::is_move_constructible_v<T> &&
        std::is_copy_assignable_v<T> &&
        std::is_move_assignable_v<T>, "unsupported type");

    using value_type = T;

    myatomic() : data{} {}
    explicit myatomic(const T& v) : data{v} {}

    myatomic(const myatomic& rhs) : myatomic(rhs.load()) {}

    myatomic& operator=(const myatomic& rhs) {
        std::scoped_lock lock(mtx, rhs.mtx);
        data = rhs.data;
        return *this;
    }

    T load() const {
        const std::lock_guard<std::mutex> lock(mtx);
        return data;
    }

    operator T() const {
        return load();
    }

    void store(const T& v) {
        const std::lock_guard<std::mutex> lock(mtx);
        data = v;
    }

    myatomic& operator=(const T& v) {
        store(v);
        return *this;
    }

    // partial compare and full swap
    template<typename Mptr, typename V>
    bool exchange_if_equals(Mptr mvar, V mval, const T& oval) {
        const std::lock_guard<std::mutex> lock(mtx);
        if(data.*mvar == mval) {
            data = oval;
            return true;
        }
        return false;
    }

    template<typename Mptr>
    auto get(Mptr mvar) const {
        const std::lock_guard<std::mutex> lock(mtx);
        return data.*mvar;
    }

    template<typename Mptr, typename V>
    void set(Mptr mvar, const V& v) {
        const std::lock_guard<std::mutex> lock(mtx);
        data.*mvar = v;
    }

private:
    mutable std::mutex mtx;
    T data;
};

struct Data {
    size_t m_index;
    char m_data;
};

int main() {
    Data orig{10, 'a'};
    Data zz; // some initialization

    myatomic<Data> dd(orig);
    dd.exchange_if_equals(&Data::m_index, 10U, zz);
    std::cout << dd.get(&Data::m_index);
}

edited May 28 '20 at 23:32

answered May 27 '20 at 22:10

Ted Lyngmo

93,841
5
60
108

You probably don't want to use SFINAE for is_trivially_copyable_v and similar, since you don't want to "hide this overload" resulting in confusing errors. Prefer static assert. – Mooing Duck May 27 '20 at 22:16
does `data` have to be volatile to guarantee that the other threads actually read the value written? – Mooing Duck May 27 '20 at 22:18
@MooingDuck I don't see volatile being necessary. The mutex/lock should guarantee the correct order of things. – Ted Lyngmo May 27 '20 at 22:25
1

@MooingDuck: it looks free from data races (using a mutex even on `load()`), so no it doesn't have to be `volatile`. And if that was the case, `std::atomic` with memory_order_relaxed would probably be better unless you somehow managed to avoid the possibility of tearing without being fully safe? [When to use volatile with multi threading?](https://stackoverflow.com/a/58535118) - essentially never except as a huge hack to work around missed optimizations in std::atomic in weird cases in some implementations. – Peter Cordes May 27 '20 at 22:32
@PeterCordes: I'd forgotten, thanks. Five years of Java where you actually need `volatile` has confused me :( – Mooing Duck May 27 '20 at 23:09
I'd recommend omitting a copy constructor and `operator=` for the same reason that `std::atomic` deletes those functions; you don't want code to ever use them by accident, and the entire copy operation isn't atomic, it's just an atomic load + atomic store. – Peter Cordes May 28 '20 at 05:56
@PeterCordes The resolution to double locking I had in mind with my all-out class guard is fragile and I can think of a large number of ways to break my idea above. It may also work well in specific situations, like the one OP described. I dropped the `is_trivially_copyable_v` to allow for less light objects. It's not _atomic_ in the original sense but supports the notion of it, which is terrible if you count cycles, but it may help in this specific situation and possibly a few similar ones. I'll leave to copying in. The name should perhaps be `classlocked` instead of `myatomic` though. – Ted Lyngmo May 28 '20 at 23:39
I'm not saying you should try to provide an atomic transaction between two separate objects of this class, or make that happen by default. I'm just saying that your copy and assignment operators are like `atomic1.store( atomic2.load() );` which is fine, but usually not the right way to use atomic variables. `std::atomic` deleting the copy-constructor stops people from unwisely making `std::vector>` and then doing something that might reallocate and copy + invalidate the old objects (which other threads might have a reference to and be about to read). – Peter Cordes May 28 '20 at 23:45
Your code doesn't need `myatomic::operator=(myatomic &)` to work, only `operator=(T&)`. – Peter Cordes May 28 '20 at 23:47
Oh, I see you're using `scoped_lock` to lock both objects in `operator=`, unlike in the copy constructor. I guess since you can provide an atomic transaction, it maybe makes sense to do so. (Unlike lock-free std::atomic which can't provide that on most hardware.) However, unless you also have other 2-object functions (like returning both of them), I'm not sure how useful it is. And maybe should only be exposed via a named function, not `operator=` which can be used implicitly. – Peter Cordes May 28 '20 at 23:51
Anyway, I really don't see why this question would justify using a mutex; as my answer shows you can get efficient lock-free asm if you take some care in writing the code. I guess if you really need an 8 byte index + other data, 16-byte atomics are not as nice and come closer to justifying a mutex. So anyway, it's not a bad thing to write this, for other cases of larger objects; it's sort of addressing a different use-case than my answer. – Peter Cordes May 28 '20 at 23:52
@PeterCordes I'm trying to write an answer but you spot stuff faster than I can react :-) Love it! – Ted Lyngmo May 28 '20 at 23:53
@PeterCordes Atomic was the wrong word. It was an idea for a shield for larger enteties (because I'm _sure_ noone in the whole world has had this unique idea ever - irony). I put the scoped lock in to make any readers (and myself) think of the implications of copying etc.The worst part in my idea isn't that as I see it. If the guarded class has accessors or iterators, all h*ll breaks loose. It's not a well designed idea, but given the right restrictions, it might just work - and be good at it. – Ted Lyngmo May 29 '20 at 00:05
`Your code doesn't need myatomic::operator=(myatomic &) to work, only operator=(T&)`: My code needs deleted moves and a new name as I see it. `myatomic` is false branding. – Ted Lyngmo May 29 '20 at 00:05
`std::atomic` for types that are too large and thus not lock-free on a given implementation work pretty much like you're doing manually, except they use a hash table of locks instead of taking space in each object for a mutex. Except `atomic` still deletes the copy-constructor and `operator=` to stop you from doing that by accident. – Peter Cordes May 29 '20 at 00:09
@PeterCordes So, I just reinvented the wheel, but got a wobbly one? :-) Oh, well, I'll try to get it better next time. I don't believe in accidents as much as the `explicit` priests do though. Never happened to me .... (famous last words) – Ted Lyngmo May 29 '20 at 00:10
@PeterCordes If I had gotten everything right, would the general idea work? It makes one lock, checks the condition of one particular member and full-swaps (within one lock)? Could it theoretically be _good_ or is the idea bad - given OP:s requirements? – Ted Lyngmo May 29 '20 at 00:17
1

I think your implementation is fine (and certainly the general idea). Doing locking manually does let you create atomic primitives that `std::atomic` won't expose, like this partial-compare / full-store. That's a reasonable thing to do, although your way does make it impossible to do more stuff in one transaction without unlocking / re-locking. Perhaps a function that takes a lambda or `std::function` to run with the lock held would be a good building block? Users would have to be careful with that not to create deadlocks if they access other `myatomic` objects, though. – Peter Cordes May 29 '20 at 00:20
@PeterCordes I like how you look at this little embryo of an idea and give me ideas of how it could be made into something real. Really! My knowledge in this area is on a much higher level as it is though. I'd rather `delete` functionality for the time being since I'm not comfortable implementing the full version of this beast – Ted Lyngmo May 29 '20 at 00:30
Samuel Liew: Can you please leave this conversation as is? It's finished but I personally will find it useful if I revisit this question and perhaps others will too. – Ted Lyngmo May 29 '20 at 09:09

Partial-compare-and-full-swap for atomic values

3 Answers3