8

Why does the LOCK prefix cause a full barrier on x86? (And thus it drains the store buffer and has sequential consistency)

For LOCK/read-modify-write operations, a full barrier shouldn't be required and exclusive access to the cache line seems to be sufficient. Is it a design choice or is there some other limitation?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
yggdrasil
  • 737
  • 5
  • 14
  • 2
    short answer: it's both a load and a store (which have to stay atomically together in the global order of operations), so it can't reorder with either in either direction. So it ends up *having* to be a full barrier. – Peter Cordes Feb 21 '20 at 06:19
  • 1
    @PeterCordes I though about that, however it is a load-then-store and x86 memory model already prohibits LoadStore reorderings. Isn't it sufficient? – yggdrasil Feb 21 '20 at 07:50
  • 1
    Yes, but consider some examples, e.g. RMW then a load. Can the RMW be delayed and appear after the load, like a normal store? No, because it would bring its load with it, and that would be LoadLoad reordering. – Peter Cordes Feb 21 '20 at 07:53
  • 1
    @PeterCordes Uhm I see, so in that case it would be to prevent the other load to "sneak" between the RMW load & store? (which would lose its atomicity) – yggdrasil Feb 21 '20 at 09:04
  • 1
    (Which also happens if a store coming before the RMW gets reordererd with its load, making LOCK boundaries effectively a full barrier?) – yggdrasil Feb 21 '20 at 09:07
  • 2
    pretty much. AFAICT, the only difference between an acq_rel RMW and a seq_cst RMW ISO C++ is that acq_rel doesn't forbid IRIW reordering (when the load part observes a pure store from another core), but x86's total store order never allows that. Although see comments: [How do memory\_order\_seq\_cst and memory\_order\_acq\_rel differ?](//stackoverflow.com/posts/comments/104467665) – Peter Cordes Feb 21 '20 at 09:37
  • RMWs on LL/SC architectures are trickier to think about. One attempt I made: [What exact rules in the C++ memory model prevent reordering before acquire operations?](//stackoverflow.com/a/52636008). You can reorder as long as the final result is compatible with there being an atomic RMW *somewhere* in the modification order of the target cache line, and in any global order any other core could see. Planning to write a proper answer soon, but leaving comments while I think about it. – Peter Cordes Feb 21 '20 at 09:46
  • I see, very interesting. Thanks for the useful explanations! – yggdrasil Feb 21 '20 at 10:10
  • 1
    To support a "relaxed" read-modify-write, I think you are right, locking the cache line(s) would ensure that no other thread's write can become visible between the read and the write. But either (a) that lock must be held until the store buffer drains, or (b) the write would need to jump the buffer. I guess (b) adds some complexity, plus it would be incompatible with previous `LOCK` implementations. `LOCK` has not been a global lock since the 486 (I believe). What the processor does is like (a) -- which maps to seq_cst given the general 'strength' of the x86 memory model. – Chris Hall Feb 21 '20 at 10:32
  • @PeterCordes _the only difference between an acq_rel RMW and a seq_cst RMW ISO C++ is that acq_rel doesn't forbid IRIW reordering_; Considering e.g. POWER, if an acq_rel RMW has to guarantee a StoreLoad order (otherwise a LoadLoad may result if two RMW operations are reordered), then it has to drain the store buffer; In that case IRIW is not possible, isn't it a contradiction to the fact that acq_rel RMW doesn't forbid IRIW? – Daniel Nitzan Dec 31 '20 at 09:28
  • 1
    @DanielNitzan: ISO C++'s formal rules can be weaker than any real ISA in practice. I'm pretty sure POWER can still do IRIW between the outputs of two acq_rel RMWs if the observers are acquire loads (not what I described in my earlier comment), unless stwcx stores completely bypass the store buffer? I'm not sure about observing with two acq_rel exchanges or fetch_add(0)s in each reader thread, though. You could ask that as a separate SO question about POWER; it's too complex for these comments. – Peter Cordes Jan 03 '21 at 11:11
  • @DanielNitzan: Note that https://godbolt.org/z/Pvxc99 shows that acq_rel fetch_add does *not* include a `sync`, only `lwsync` before (release) and `isync` after (acquire), but two such RMWs back to back might be enough to make it impossible or at least implausible on real hardware. – Peter Cordes Jan 03 '21 at 11:17
  • 1
    @DanielNitzan: The store buffer is [the mechanism](https://stackoverflow.com/a/50679223) on real HW as you say, but note that even the SC version only does `sync` *before* the RMW retry loop. (SC pure-load costs a `sync` on POWER). In any case, unless the store bypasses the SB entirely (which would actually make sense for an atomic store-conditional; I'm retracting my earlier "pretty sure"), it will be in the SB and probably graduate for at least a cycle before it actually commits, so there could be a window of opportunity for cross-SMT store forwarding before it becomes globally visible. – Peter Cordes Jan 03 '21 at 11:19
  • @PeterCordes Thanks, my logic was flawed, I was trying to reason about reordering of two back to back RMW's, which is forbidden, though has nothing to do with a StoreLoad guarantee. SB delay and SLF can manifest themselves as you've mentioned above. – Daniel Nitzan Jan 03 '21 at 19:52

1 Answers1

9

Long time ago, before the Intel 80486, Intel processors didn't have on-chip caches or write buffers. Therefore, by design, all writes become immediately globally visible in order and you didn't have to drain stores from anywhere. A locked transaction is executed by fully locking the bus for the entire address space.

In the 486 and Pentium processors, write buffers have been added on-chip and some models have on-chip caches as well. Consider first the models that don't have on-chip caches. All writes are temporarily held in on-chip write buffers until they are written on the bus when available or a serializing event occurs. Remember that atomic RMW transactions are used to acquire exclusive access to software structures or hardware resources. So if a processor performs a locked transaction, it shouldn't happen that the processor thinks that it got granted ownership of the resource but then another processor also somehow ends up obtaining ownership as well. If the write part of the locked transaction gets buffered in a write buffer and then the bus lock is relinquished, there is nothing that prevents other agents from also acquiring access to the resource at the same time. Essentially, the write part has to be made visible to all other agents and the way to do this is by not buffering it. But the x86 memory model requires that all writes become globally visible in order (there was no weak ordering on these processors). So in order to make the write part of a locked transaction globally observable, all buffered writes had also be made globally observable in the same order.

Some 486 models and all Pentium processors have on-chip caches. But on these processor, there was no support for cache locks. That's why locked transactions were not cacheable on these processors because the only way to guarantee atomicity was to bypass the cache and lock the bus. After acquiring the bus lock, one or more writes are performed depending on the alignment and size of the destination memory region. The write buffers still have to be drained before releasing the bus lock.

The Pentium Pro introduced some major changes including weakly-ordered writes, write-combining buffers, and cache locking. What was called "writes buffers" is what is usually referred to as store buffers on more modern microarchitectures. A locked transaction utilizes cache locking on these processors, but the cache lock cannot be released until committing the locked store from the store buffer to the cache, which makes the store globally observable, which necessarily requires making all earlier stores globally observable. These events have to happen in that order. That said, I don't think locked transactions have to serialize weakly-ordered writes, but Intel has decided to make them this way. Maybe because Intel wanted a convenient instruction that drains WC buffers on the PPro in the absence of a dedicated store fence.

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95