How to lock-free update index to maximum over a range?

Question

The problem is best explained with some simple code.

struct foo
{
    static constexpr auto N=8;
    double data[N];            //  initialised at construction
    int max;                   //  index to maximum: data[max] is largest value

    // if value < data[index]:
    //   – update data[index] = value
    //   - update max
    void update(int index, double value)
    {
        if(value >= data[index])
            return;
        data[index] = value;
        if(index==max)         // max unaffected if index!=max
            for(index=0; index!=N; ++index)
                if(data[index] > data[max])
                    max = index;
    }
};

Now, I want to make foo::update() thread safe, i.e. allow concurrent calls from different threads, where participating threads cannot call with the same index. One way is to add a mutex or simple spinlock (the contention can be presumed low) to foo:

struct foo
{
    static constexpr auto N=8;
    std::atomic_flag lock = ATOMIC_FLAG_INIT;
    double data[N];
    int max;

    // index is unique to each thread
    // if value < data[index]:
    //   – update data[index] = value
    //   - update max
    void update(int index, double value)
    {
        if(value >= data[index])
            return;
        while(lock.test_and_set(std::memory_order_acquire));  // aquire spinlock
          data[index] = value;
          if(index==max)
              for(index=0; index!=N; ++index)
                  if(data[index] > data[max])
                      max = index;
        lock.clear(std::memory_order_release);                // release spinlock
    }
};

However, how can I implement foo::update() lock-free (you may consider data and max to be atomic)?

NOTE: this is a simpler version of the original post, without relation to the tree structure.

This is a hard question; a lock-free tree is a research topic by itself — LWimsey, Jul 05 '17 at 16:04
@LWimsey I have changed/simplified the question to avoid any complications arising from the tree. — Walter, Jul 06 '17 at 08:19
`value = data[max=index];` should be `max = index;`, isn't it? — Tsyvarev, Jul 06 '17 at 08:37
@Tsyvarev Not in the code originally posted: `value` must be updated too (since it's used in the previous line). I will edit the post to avoid that and make it a bit clearer. — Walter, Jul 06 '17 at 08:44
Your hand-rolled lock isn't safe: you usually need seq_cst while taking a lock, to prevent loads in the critical section from being visible before the store that actually takes the lock. Remember that `acquire` only applies to the load part of the test-and-set, not the store. On x86, it will compile to a `lock xchg` or `lock bts` or something, so you could only see the problem on an architecture where an atomic RMW with weaker than `seq_cst` can actually compile that way in the asm. (This might not be the exact right reasoning, but I'm pretty sure `acquire` is too weak.) — Peter Cordes, Aug 10 '17 at 03:36
This might be a good use-case for Intel TSX (transactional memory): https://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions, if you're on x86 hardware that supports it. Actually, IDK if that gains you much over just taking a lot, as long as most calls take the fast-path that doesn't involve any locking. — Peter Cordes, Aug 10 '17 at 04:19
@PeterCordes With `test_and_set` being an indivisible atomic RMW, how can a load in the critical section be reordered with the store part without also being reordered with the load part (violating the acquire barrier) ? — LWimsey, Aug 10 '17 at 05:17
@LWimsey: Consider the case where `test_and_set` is implemented on a CPU [with LL/SC retry loop](https://en.wikipedia.org/wiki/Load-link/store-conditional), because it doesn't have a memory-destination atomic RMW instructions e.g. ARM. If the load is an acquire-load, but the store is a relaxed store, stuff in the critical section can pass the store. Even a release-store doesn't stop loads from escaping. I know I read something about needing a stronger memory barrier when taking a lock, which made sense at the time, but I can't seem to find it now. I think this was the argument. — Peter Cordes, Aug 10 '17 at 05:37
@LWimsey: Without extra fencing, [LL/SC doesn't stop the load and store from reordering with other instructions](https://groups.google.com/d/msg/lock-free/2W6XE66KW-A/XSVzyd_xrCQJ). But it's still atomic for that location: any other observer observing or modifying *that location* will see it or now. So it's kind of like a very limited memory transaction, rather than an atomic RMW. — Peter Cordes, Aug 10 '17 at 05:46
@PeterCordes Interesting (thanks for the link), but your argument was that `test_and_set` has to be seq/cst.. That will add a release part to the store, which, I believe, still does not prevent the StoreLoad re-ordering you mentioned. It does not turn a store into a full barrier, only guarantees a single global order for these seq/cst operations — LWimsey, Aug 10 '17 at 06:05
@LWimsey: `mo_acq_rel` is still weaker than `mo_seq_cst`. So yes, you do need `seq_cst` to stop loads escaping the critical section, which does require a full `mfence`-style barrier. Your argument makes some sense, but I know that gcc on x86 inserts an `mfence` after a `seq_cst` store but not after a `release` store. So `seq_cst` is stronger than what you're thinking of. — Peter Cordes, Aug 10 '17 at 06:09
@PeterCordes Yes, but that is implementation.. a compiler could insert an `mfence` before each seq/cst load (inefficient but correct) and let a seq/cst store be a regular `mov`. Point being, seq/cst operations are only seq/cst with respect to each other. The only guarantee that applies to a single seq/cst store is that it has `release` semantics — LWimsey, Aug 10 '17 at 06:13
@LWimsey: Hrm, yes, you're right. Even a seq_cst test_and_set doesn't used a memory-barrier instruction on ARM: https://godbolt.org/g/sqwtY6. Really what's needed is an `atomic_thread_fence(mo_seq_cst)` after the `test_and_set`, because that will keep the critical section's loads from escaping. ([fences are stronger than regular operations tagged acq_rel](http://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect/), and http://en.cppreference.com/w/cpp/atomic/memory_order#Sequentially-consistent_ordering goes into a lot of detail about fencing.) — Peter Cordes, Aug 10 '17 at 06:31
Also, turns out `atomic_flag` isn't just a bit inside a byte. Since it always takes at least one whole byte to itself, gcc implements it with a single byte, and uses `xchg` for test-and-set, and just a simple release-store for `clear`. — Peter Cordes, Aug 10 '17 at 06:33
@LWimsey: Anyway, thanks for clearing up my x86-centric thinking. I had always thought of seq_cst as implying a full memory barrier on anything with a store or RMW, and hadn't realized it wasn't like that on LL/SC architectures. That makes it suck to roll your own spinlock, since on x86 you end up with two fences: the `xchg` and the `mfence`. Of course, it sucks a lot to roll your own spinlock for other reasons, too! You lose out on stuff like OS-supported fallback, and on the `pause` instruction on x86 to reduce power and other benefits. And spinning on an atomic exchange sucks, too. — Peter Cordes, Aug 10 '17 at 06:34
@PeterCordes thank you too, I am glad I learned more about this LL/SC stuff — LWimsey, Aug 10 '17 at 06:45
@LWimsey: Apparently on ARM64, LL-acquire/SC-release is already enough to order a LDAR, unless gcc has a bug. I put a `load(seq_cst)` after a `test_and_set(seq_cst)` and got the same asm as for a `load(acquire)` on ARM64(https://godbolt.org/g/QeBg8C). On PowerPC 64 (notoriously weakly-ordered), there is an extra `sync` instruction before the plain `lwz` with `seq_cst`, but not with `acquire` (https://godbolt.org/g/2Kc7tK). I assume that orders it with respect to the `isync` after the byte LL/SC. So one way to roll your own lock would be to use seq_cst loads inside the "critical section". — Peter Cordes, Aug 10 '17 at 12:01
@PeterCordes I've given it more thought and I am not convinced an acquire barrier is too weak for the given scenario.. Maybe some day I'll post a new question. — LWimsey, Aug 11 '17 at 04:39
Why not remove the tracking of `max` from the update function (at which point it becomes lock-free as long as atomic assignments to `data` are lock-free), and then simply calculate `max` on demand when required? With `N=8` this doesn't seem too bad even if requests for `max` are frequent. — BeeOnRope, Aug 12 '17 at 21:31

tony · Answer 1 · 2017-08-13T07:19:17.627

So, IIUC, the array only gets new values if they are lower than what is already there (and I won't worry about how the initial values got there), and if the current max is lowered, find a new max.

Some of this is not too hard. But some is... harder.

So the "if value < data[index] then write data" needs to be in a CAS-loop. Something like:

auto oldval = data[index].load(memory_order_relaxed);
do
   if (value <= oldval) return;
while ( ! data[index].compare_exchange_weak(oldval, value) );
// (note that oldval is updated to data[index] each time comp-exch fails)

So now data[index] has the new lower value. Awesome. And relatively easy. Now about max.

First question - Is it OK for max to ever be wrong? Because it may currently be wrong (in our scenario, where we update data[index] before handling max).

Can it be wrong in some ways, not others? ie let's say our data is just two entries:

data[2] = { 3, 7 };

And we want to do update(1, 2) ie change the 7 to a 2. (And thus update max!)

Scenario A: set data first, then max:

data[1] = 2;
pause(); // ie scheduler pauses this thread
max = 0; // data[0]==3 is now max

If another thread comes in at pause(), then data[max] is wrong: 2 instead of 3 :-(

Scenario B: set max first:

max = 0; // it will be "shortly"?
pause();
data[1] = 2;

Now a thread could read data[max] as 3 while 7 is still in data. But 7 is going to become 2 "soon", so is that OK? Is it "less wrong" than scenario A? Depends on usage? (ie if the important thing is "which is max" we have that right. But if max was the only important thing, why store all the data at all?)

It seems odd to ask "is wrong OK", but in some lock-free situations it actually is a valid question. To me B has a chance to be OK for some uses, whereas A doesn't.

Also, and this is important:

data[max] is always wrong, even in the perfect algorithm

By this I mean you need to realize that data[max], as soon as you read it is already out of date - if you are living in a lockfree world. Because it may have changed as soon as you read it. (Also because data and max change independently. But even if you had a function getMaxValue() it would be out of date as soon as it returns.)

Is that OK? Because, if not, you obviously need a lock. But if it is OK, we can use it to our advantage - we might be able to return an answer which we know is somewhat incorrect / out-of-date, but no more incorrect than what you could tell from the outside.

If neither scenario is OK, then you must update max and data[index] at the same time. Which is hard since they don't fit into a lock-free sized chunk.

So instead you can add a layer of indirection:

struct DataAndMax { double data[N]; int max; };
DataAndMax * ptr;

Whenever you need to update max, you need to make a whole new DataAndMax struct (ie allocate a new one), somehow fill it all out nicely, and then atomically swap ptr to the new struct. And if some other thread changed ptr while you were preparing the new data, then you would need to start over, since you need their new data in your data.

And if ptr has changed twice, then it may look like it hasn't changed, when it really has: Let's say ptr currently has value 0xA000 and a 2nd thread allocates a new DataAndStruct at 0xB000, and sets ptr to 0xB000, and frees the old one at 0xA000. Now yet another thread (3rd) comes in, allocates yet another DataAndStruct - and low and behold the allocator gives you back 0xA000 (why not, it was just freed!). So this 3rd thread sets ptr to 0xA000.

And this all happens while you are trying to set ptr to 0xC000. All you see is that ptr was 0xA000, and later still is 0xA000, so you think it (and its data) hasn't changed. Yet it has - it went from 0xA000 to 0xB000 (when you weren't looking) back to 0xA000 - the address is the same, but the data is different. This is called the ABA problem.

Now, if you knew the max number of threads, you could pre-allocate:

DataAndMax dataBufs[NUM_THREADS];
DataAndMax * ptr; // current DataAndMax

And then never allocate/delete and never have ABA problems. Or there's other ways to avoid ABA.

Let's go back, and think about how we are going to - no matter what - return a max value that is potentially out of date. Can we use that?

So you come in, and first check if the index you are about to write to is the important one or not:

if (index != max) {
    // we are not touching max,
    // so nothing fancy here!
    data[index] value;
    return;
}
// else do it the hard way:
//...

But this is already wrong. After the if and before the set, max may have changed. Does every set need to update max!?!?

So, if N is small, you could just linear search for max. It may be wrong if someone makes an update while searching, but remember - it could also be wrong if someone makes an update right after searching or right after "insert magic here". So searching, other than possibly being slow, is as correct as any algorithm. You will find something that was, for a moment, max. If N == 8, I would use searching. Definitely. You can search 8 entries using memory_order_relaxed possibly, and that will be faster than trying to maintain anything using stronger atomic ops.

I have other ideas:

More bookkeeping? Store maxValue separately?

double data[N];
double maxValue;
int indexOfMax;


bool wasMax = false;
if (index == indexOfMax)
    wasMax = true;
data[index] = value; 
if (wasMax || index == indexOfMax)
    findMax(&indexOfMax, &maxValue); // linear search

That would probably need a CAS-loop somewhere. Still linear search, but maybe less often?

Maybe you need extra data at each entry? Not sure yet.

Hmmmm.

This is not simple. Thus if there is a correct algorithm (and I think there is, within some constraints) it is unlikely to not have bugs. ie a correct algorithm might actually exist, but you don't find it - what you instead find is an algorithm that just looks correct.

`atomic` exists now. [Compilers currently suck at it, but C++11 exposes pretty much everything that x86 supports](https://stackoverflow.com/a/45086258/224132), except for TSX (transactional memory). There is no hardware support for atomic `+=` or atomic `max()` on a `double` in memory, so you (or the compiler) have to do it with CAS (x86 `cmpxchg`, which is perfectly capable of operating on 64-bit `double` bit-patterns). What do you think C++20 could do about anything? — Peter Cordes, Aug 10 '17 at 03:52
Good point about `max` being out-of-date instantly. Storing the max value as well as the index might help. Maybe in an 8B `atomic` of an `int32 index` and a `float val`, which you can load with a single atomic load. Then you might be able to control which way the errors go. (e.g. maybe you could arrange the order of updates so that `maxval` is always <= the max value in the array.) That might allow you to keep the fast-path through the function very light-weight, with no retry-loop needed when you're decreasing an index other than the current max. — Peter Cordes, Aug 10 '17 at 06:58
Hrm, then recheck the max after storing into the array? No, that's no good, I don't think you can have the fast path that light weight (wait free, and not even an atomic RMW) without allowing for errors when another thread decreases max below the value you just stored. — Peter Cordes, Aug 10 '17 at 07:01
I think if you know different threads can't have the same `index`, you probably don't want a contiguous array anyway. Having multiple threads writing to the same cache line causes false-sharing that you could avoid by having them each write to their own index. Other threads reading a cache line you wrote is a lot less bad than them writing it, because it just has to become Shared, not Invalidated; you can get it back to Modified without having to reload it. ([MESI](https://en.wikipedia.org/wiki/MESI_protocol)). — Peter Cordes, Aug 10 '17 at 07:07
If you had each thread maintain its own max, and then survey all threads when you need to read max, that might help (assuming you read less often than write). — tony, Aug 12 '17 at 05:32
@PeterCordes for atomic float, see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0020r5.html — tony, Aug 12 '17 at 05:33
Interesting. I guess CPUs like ARM that use [LL/SC](https://en.wikipedia.org/wiki/Load-link/store-conditional) instructions for atomicity can support atomic `fetch_add` for float just as efficiently as integer, but C++11 doesn't give access to it. The only thing new in that proposal is `fetch_add` and operator `+=`/`-=`. C++11 already supports everything else, including `compare_exchange_weak`, for `atomic`. — Peter Cordes, Aug 12 '17 at 05:50
Hmm, LL/SC would also make atomic multiply, divide and min/max possible (even for integer) with better performance than a CAS loop, but that proposal doesn't expose any of those. C++11 basically ignores LL/SC because it's so low-level and with such platform-specific restrictions on what you can do between an LL and SC that it makes it hard to portably expose it. https://groups.google.com/a/isocpp.org/forum/#!topic/std-discussion/pUrS0aXX_uM. — Peter Cordes, Aug 12 '17 at 05:57
@PeterCordes - why would LL/SC have better performance than a CAS loop for those ops (unless the underlying LL/SC operation is somehow faster than CAS)? Implementing all of those things with LL/SC follows exactly the same pattern as CAS - read, calculate updated value, and "store", with the "store" possibly failing in either implementation. The only intrinsic advantage algorithmic LL/SC has, AFAIK, is in detection of intervening writes which makes ABA detection automatic. Of course, you might mean that LL/SC is faster at the hardware level... — BeeOnRope, Aug 12 '17 at 21:22
@BeeOnRope: I was thinking that HW support for LL/SC might lock the cache line on LL, so the SC was more likely to succeed. Or delay the LL until the line was in E state. Anyway, basically provide hardware arbitration to allow `N` LL/SC operations on different cores to complete with not much more than `N` attempts, like x86 HW does for `lock add [mem], 1`. Far better than the worst-case behaviour with many cores hammering away with load/stuff-in-registers/CAS/retry. If the CAS is implemented with LL/SC, then on dumb hardware you only save the initial load, but smart HW can do much better. — Peter Cordes, Aug 12 '17 at 23:36
@BeeOnRope: With no hardware arbitration, and too many threads in LL/SC loops on the same shared memory address, you could get a livelock situation: the SC always fails on every CPU because it lets the line be invalidated between the LL and SC. I tried to google some to find out if real hardware does hardware arbitration for LL/SC, but wasn't able to find anything practical with those search terms. — Peter Cordes, Aug 12 '17 at 23:52
@PeterCordes I was thinking atomic floats were more restricted (or not atomic) - I guess I was wrong. (I've never had an occasion that needed them.) I've removed that part of my answer. — tony, Aug 13 '17 at 07:21
@tony: I've never used atomic floats either. I don't recommend it when you can cheaply avoid it, since compilers aren't good at them. It was interesting to see what happens when you throw some `atomic` code at them, like I did in the answer I linked. — Peter Cordes, Aug 13 '17 at 07:24
@PeterCordes - ah, right, you are saying that it can work better at the hardware level. It certainly could be the case. A directly apples to apples comparison on real hardware is tough because platforms tend to have one or the other, not both, and then each platforms capabilities are usually quite different, and may have different memory models etc. For example, CAS may perform better on x86 than LL/SC on platform X simply because it has more has had more hardware and engineering resources poured into it. — BeeOnRope, Aug 13 '17 at 18:46
You point out an issue with CAS though - that the initial `load` is usually a plain one, so you waste time first getting the line in S state (usually) and then almost immediately after trying to get it in M state when the `cmpxchg` comes through, probably increasing the window for the line to be stolen and generally increasing cache and coherency traffic. I wonder if blind-writing to an adjacent, unused, position on the same line first, to get the line in M initially, would actually be an optimization in some cases. LL/SC doesn't have this issue since the LL is already special. — BeeOnRope, Aug 13 '17 at 18:48

How to lock-free update index to maximum over a range?

1 Answers1