Will a non-atomic load to the same cache line as an atomic variable cause the atomic variable to fail?

Question

Given something like this on an ARMv8 CPU (though this may apply to many others as well):

class abcxzy 
{
  // Pragma align to cacheline to ensure they exist on same line.
  unit32_t atomic_data;
  uint32_t data;

  void foo()
  {
    volatile asm (
      "   ldr w0, [address of data]\n"
      "# Do stuff with data in w0..."
      "   str w0, [address of data]\n"

      "1: ldaxr w0, [address of atomic_data]\n"
      "   add w1, w0, #0x1\n"
      "   stxr w2,w1, [address of atomic_data]\n"
      "   cbnz w2, 1b\n"
    );
  }
}

With proper clobbers and such set on the Asm inline so that C and Asm can coexist happily in a world of rainbow ponies and sunshine.

In a multiple CPU situation, all running this code at the same time, will the stores to data cause the atomic load/store to atomic_data to fail? From what I've read, the ARM atomic stuff works on a cache line basis, but it is not clear if the non-atomic store will affect the atomic. I hope that it it doesn't (and assume that it does...), but I am looking to see if anyone else can confirm this.

Both atomic and non-atomic operations work on a cache line basis. It's allowed by the C++ memory model to concurrently modify variables at a different address, even when they're in the same cache line (which is somewhat hard to tell anyway). — LWimsey, Dec 06 '17 at 19:26
@LWimsey The C++ memory model goes out the window when `asm` is used. A C++ compiler might for example make various optimizations (eg. accessing `atomic_data` and `data` in one 64-bit access or not using "unnecessary" atomic instructions) that still results in the code it generates conforming to the C++ memory model, but breaks when this asm statement is used (or breaks the assumptions this asm statement makes). — Ross Ridge, Dec 06 '17 at 20:03
Honestly, more concerned if the `stxr` will return `1` here because the hardware detected that someone modified the atomic variable's cache line, but not the actual atomic variable so therefore causing spurious failures. If so, then in a case where many other vars are on the atomic's line, it could cause a case where the atomic is never able to "complete" due to this cache line meddling and therefore causing a forever lock trying to atomically increment the variable. — Michael Dorgan, Dec 06 '17 at 20:12
Certainly false negatives will be an issue if the C++ compiler is also generating exclusive access instructions that access the same cache line and are being run in parallel on other CPUs at the same time. It seems unlikely that you'd run into the case where one thread never is able to acquire a lock or whatever because of this contention, but ultimately it's up to you to ensure that. — Ross Ridge, Dec 06 '17 at 20:49
But what about non-exclusive accesses? Will non-exclusive accesses to the same cache line as an exclusive access cause contention? — Michael Dorgan, Dec 06 '17 at 21:51
Are you asking whether stores to `data` from *different* CPUs will cause the atomic RMW of `atomic_data` to fail, or about accesses to `data` from the *same* CPU? In the first case, the answer is an obvious YES, a write to the same cacheline by another thread will absolutely cause the RMW to fail. — EOF, Dec 06 '17 at 23:43

Michael Dorgan · Accepted Answer · 2017-12-07T19:32:11.163

2

Ok, finally found what I needed, though I don't like it:

According to the ARM documentation, It is IMPLEMENTATION DEFINED whether a non-exclusive store to the same cache line as the exclusive store causes the exclusive store to fail. Thanks ARM. Appreciate that wonderful non-conclusive info.

Edit:

By fail, I mean the stxr command did not write to memory and returned a "1" in the status register. "Your atomic data updated and needs new RMW" status.

To answer other statements:

Yes, atomic critical areas should be as small as possible. The docs event give numbers on how small, and they are very reasonable indeed. I hope that my sections never span 1k or more...
And yes, any situation where you would need to worry about this kind of contention killing performance or worse means your code is "doing it wrong." The ARM docs are state this in a round about manner :)
As to putting the non-atomic loads and stores inside the atomics - my pseudo test above was just demonstrating a random access to the same cache line as an example. In real code, you obviously should avoid this. I was just trying to get a feeling for how "bad" it might be if, perhaps a high speed hardware timer store was hitting the same cache line as a lock. Again, don't do this...

edited Dec 07 '17 at 19:32

answered Dec 06 '17 at 23:42

Michael Dorgan

12,453
3
31
61

2

Just to be clear, when they say "fail", they almost certainly mean that the `stxr` will need to retry. (i.e. that it won't store anything, and will set `w2` non-zero so the CBNZ will be taken). It's not really plausible that it could result in a non-atomic store becoming visible to other threads! Especially when the store isn't even between the `ldaxr` and `stxr` (LL/SC pair) – Peter Cordes Dec 07 '17 at 04:46
You could test on the CPU(s) you have access to by instrumenting your code: incrementing one counter outside the LL/SC retry loop, and another (in a register) inside the LL/SC loop. The difference is the number of LL/SC retries. If you get more than interrupts can explain in single-threaded code, then there are conditions where an in-flight store to the same cache line is problematic. – Peter Cordes Dec 07 '17 at 04:49
Of course ARM's position is is _totally reasonable_. They want to keep their options for a variety of current and future implementations, and even if they did want to explain the behavior it would probably not have a simple rule at all. You should assume that concurrent stores to the same line could cause some LL/SC operations to fail since the general model how these work is that the load basically sets a flag on the cache line "has a pending SC" which gets cleared if the line gets lost to another core. That's why you want to keep the critical region as small as possible. – BeeOnRope Dec 07 '17 at 04:56
2

Now some implementation might, as a QoS issue, have a policy where they delay relinquishing a cache line with LL set to a snoop from another core, to give the SC a better chance to succeed, but even that is going to be a short bounded delay. You think even if ARM did that on some chip they are doing to document as architectural behavior? No way! Probably not even as a performance hint in an optimization guide. It's mostly all academic since you should avoid that kind of contention anyways. – BeeOnRope Dec 07 '17 at 04:59
In practice this really shouldn't matter. Another CPU would need to be constantly writing to the same cache line you're trying to perform the atomic operation in order to cause the atomic operation to never complete and "fail". However this is a situation you need to avoid anyways, because it doesn't take such an extreme amount of contention for the same cache line to cause big performance problems. – Ross Ridge Dec 07 '17 at 09:02
1

It should be fine if your re-order the code to put the generic load/store inside the ldxr/stxr test. This will say the whole cache line was committed atomically. Also, this is faster as you are writing to CPU cache. The other way, the CPU would have to flush before the `ldarx`. If the values are not related, then don't put them in the same cache line (or always allocate atomics in cache granules). In this light, I don't think they are that limiting of a restriction. – artless noise Dec 07 '17 at 18:23
Inherited code failing a brain dead test... The truth is, I expected a write to the same cache line/granule to cause the atomic to fail as this is common sense, but was just looking for 100% confirmation from the docs. Seeing the implementation defined statement annoyed me :) This drove me to fix my problem in a 100% generic way anyway so good still came from this sojourn. – Michael Dorgan Dec 07 '17 at 19:19
This is why people do things like make mutexes 128 bytes in size and align them to 128-byte boundaries despite that the basic structure can be much smaller. It is also a strong justification for hardware atomic-fetch-op support for things like histogram computation, etc. As elegant as LL/SC seems, it is hardly perfect. – Zalman Stern Dec 07 '17 at 20:03
@ZalmanStern: if you're going to multi-thread a histogram, it's probably faster to have each thread keep a separate count array. Maybe I'm biased by x86, where atomic increment is only available with a full memory barrier; If HW supports relaxed atomic increment without hurting memory pipelining in the uncontended case, it could be good for very large bin counts where separate copies are problematic. Related: If you're not counting, but rather just setting bits, you can non-atomic update / fence / read back the globally visible state to check it: https://stackoverflow.com/a/45994685/224132. – Peter Cordes Dec 07 '17 at 21:51
Histogram was more intended as something more people can visualize/reason about than as a something where this happens a lot, but fetch-op implementations do get used on GPUs. It comes up more when things get complex and multiple types of bins are kept at once. (Also in some situations, the performance of the cross-thread reduction matters too. One can do it via a tree of course. Fetch-op tries to push these complexities to the hardware.) We plan to add atomic-arithemtic-op support to Halide for some of these GPU use cases. (I'd edit my original, but it's too late :-)) – Zalman Stern Dec 08 '17 at 01:18

Will a non-atomic load to the same cache line as an atomic variable cause the atomic variable to fail?

1 Answers1