0

Reading this answer Can x86 reorder a narrow store with a wider load that fully contains it? I have some questions about to it (sorry, my low reputation does not allow me to add there further comments).

That thread describe a simple program run on a Core in which a store is followed by a load in program order. The Core uses store-to-load forwarding to forward to the following load the content of the store waiting in the store queue (write buffer) to commit to L1D cache

This by itself isn't reordering yet (the load sees the store's data, and they're adjacent in the global order), but it leaves the door open for reordering. The cache line can be invalidated by another core after the load, but before the store commits. A store from another core can become globally visible after our load, but before our store.

So the load includes data from our own store, but not from the other store from another CPU. The other CPU can see the same effect for its load, and thus both threads enter the critical section.

Now my point is: If the cache line was invalidated by another core after our load has executed (i.e. it has returned data from L1D cache), our load could never return the new value of the store executed from another thread that is become global visible after our load but before our store commits in L1D cache, don't you ?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Carlo C
  • 51
  • 1
  • 10
  • That's exactly the point, and why that locking scheme is broken unless you put an `mfence` or equivalent between the byte store and dword reload. Without that, multiple threads can (via store-forwarding) see their own store as having won the race to store a byte of the dword or qword. This is further discussed in [Globally Invisible load instructions](https://stackoverflow.com/q/50609934) – Peter Cordes Nov 12 '21 at 14:07
  • So I don't understand what you're asking with this SO question. You seem to have just re-stated my answer in your own words. Am I missing something here? Is there some supposed contradiction between what you wrote vs. what you're quoting, or are you just asking if your rephrasing is still saying the same correct thing? – Peter Cordes Nov 12 '21 at 14:14
  • 1
    Note that executing a load can involve reading the store buffer as well as L1d cache. For a load that partially overlaps a store, both of those things have to complete before the load has finished executing. Your description only mentions L1d cache. (And my description doesn't clearly say "executing" to distinguish from stores where the relevant time is committing, becoming globally visible. Loads don't "become globally visible", although I maybe hadn't realized that when I wrote that answer.) – Peter Cordes Nov 12 '21 at 14:18
  • Sorry, maybe I missed your point. Suppose to put a `mfence` between the byte store and the following dword load. When the partially overlapped load is executed data is returned from L1D cache (store-to-load forwading does not take place because the store buffer is empty thanks to `mfence` ). Then as you said suppose the cache line is invalidated by another core after the load and a store from another core become globally visible after our load. Even in this scenario the load returns data from our previous store but not from the other store from another CPU-core. – Carlo C Nov 12 '21 at 14:52
  • 1
    Right, but if the other core runs the same code with this timing, its load will see a dword with two set bytes in it. i.e. it lost the race to take the lock, and we have mutual exclusion. This locking scheme works on a sequentially-consistent machine, and an `mfence` there is enough ordering for the lock-taking part to work on x86. Without the `mfence`, both loads can produce a value without seeing the other core's store. – Peter Cordes Nov 12 '21 at 14:58
  • 1
    Ah ok, what you described in that answer in the OP was actually a sequence of events in which neither the first nor the second Core lose the race to take the lock. As you pointed out that lock-scheme works on SC memory consistent hardware. The same thing is achieved on an x86 TSO consistent machine adding the `mfence` between the store and the load. Thank you. – Carlo C Nov 12 '21 at 15:35
  • In the above comment you said "my description doesn't clearly say "executing" to distinguish from stores where the relevant time is committing, becoming globally visible." So we claim a load "execute" when it loads relevant data from L1D cache or reload from store-to-load forwarding to the same address. On the other hand a store "execute" when it writes data into the store buffer in store queue. Then when the store uop eventually retires the store buffer is *allowed to commit* into L1D cache (provided that the relevant cache line is in MESI E or M state in the local cache), right ? – Carlo C Nov 12 '21 at 16:23
  • Yup, exactly. A load executes when a load execution unit produces a value, with data that came from L1d and/or the store buffer. – Peter Cordes Nov 12 '21 at 16:30
  • ok, so a real case in which the load execution unit gets data both from L1D cache *and* store buffer is in the scenario above: namely a store to a byte before (in program order) a load to a dword that happen to overlap with the previous store. – Carlo C Nov 12 '21 at 17:03
  • Exactly; that's why it causes a store-forwarding stall. https://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/. See also [What are the costs of failed store-to-load forwarding on x86?](https://stackoverflow.com/q/46135369) – Peter Cordes Nov 12 '21 at 17:04
  • The first link in your last comment seems to say that in x86 implementations store-to-load forwarding does not happen in cases similar to our. Is that correct ? – Carlo C Nov 12 '21 at 18:27
  • 1
    It still happens, just with extra latency and worse throughput like I described in my 2nd link. It's commonly called a store-forwarding *stall* or "failed" store-forwarding, but really it's just the *fast path* that failed. But there's still a slower mechanism to scan the whole store buffer not just waiting for the store buffer to drain, like my microbenchmark in the 2nd link demonstrates. – Peter Cordes Nov 12 '21 at 18:40
  • ok, so on x86 implementations when the load execution unit executes a load uop it gets data from L1D cache *alone* (provided the cache line is locally in M,E or S MESI state) only when there is *no* any previous store in program order waiting into store queue that overlaps even only partially with the address and the widht of the load itself, right ? – Carlo C Nov 13 '21 at 12:59
  • 1
    that's correct. – Peter Cordes Nov 13 '21 at 14:39
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/239204/discussion-between-carlo-c-and-peter-cordes). – Carlo C Nov 14 '21 at 17:01

0 Answers0