x86 memory ordering: Loads Reordered with Earlier Stores vs. Intra-Processor Forwarding

Question

I am trying to understand section 8.2 of Intel's System Programming Guide (that's Vol 3 in the PDF).

In particular, I see two different reordering scenarios:

8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations

and

8.2.3.5 Intra-Processor Forwarding Is Allowed

However, I do not understand the difference between these scenarios from the observable effects POW. The examples provided in those sections seem interchangeable to me. 8.2.3.4 example can be explained by 8.2.3.5 rule just as well as by its own rule. And the converse seems true to me as well, although I am not that sure in that case.

So here is my question: are there better examples or explanations how the observable effects of 8.2.3.4 are different from observable effects of 8.2.3.5?

Just taking a stab here, but I don't think 8.2.3.5 can be explained by 8.2.3.4. The unexpected result in the example for 8.2.3.5 doesn't happen because of an instruction re-ordering in an individual core. It happens because of the delay in a core seeing the memory update made by another core. Also, 8.2.3.4 can give you an expected result even in a single-core situation, where as 8.2.3.5 is strictly a multi-core phenomenon. — Aaron, Jan 03 '14 at 16:28
@Aaron, 8.2.3.4 is also multi-contexted, a single process shouldn't care about load reordering, since if the addresses conflict they wouldn't reorder, and otherwise reordering wouldn't affect the results. — Leeor, Jan 03 '14 at 17:10
@aaron: I see! load to R2 cannot be moved before store to [x] because load to R1 is in the way. So 8.2.3.4 cannot be used to explain 8.2.3.5. Thank you! Could you write this up as an answer? — , Jan 03 '14 at 17:54
@Leeor, ah yes, you are right. You wouldn't really see a weird result when using a single core. — Aaron, Jan 03 '14 at 18:35
@Arkadiy, I guess I see better now what you are asking... I think you're right, due to 8.2.3.2. Without 8.2.3.2, the example in 8.2.3.5 could result from 8.2.3.4, even though they are different underlying mechanisms. — Aaron, Jan 03 '14 at 18:35
Related: [Globally Invisible load instructions](https://stackoverflow.com/q/50609934) — Peter Cordes, Sep 19 '20 at 15:39

Leeor · Answer 1 · 2014-01-04T12:26:33.023

8

The example at 8.2.3.5 should be "surprising" if you expect memory ordering to be all strict an clean, and even if you acknowledge that 8.2.3.4 allows loads to reorder with stores of different addresses.

   Processor 0      |      Processor 1
  --------------------------------------
   mov [x],1        |      mov [y],1
   mov R1, [x]      |      mov R3,[y]
   mov R2, [y]      |      mov R4,[x]

Note that the key part is that the newly added loads in the middle both return 1 (store-to-load forwarding makes that possible in the uarch without stalling). So in theory, you would expect that both stores have been "observed" globally by the time both these loads completed (that would have been the case with sequential consistency, where there is a unique ordering between stores and all cores see it).

However, having later R2 = R4 = 0 as a valid outcome proves this is not the case - the stores are in fact observed locally first. In other words, allowing this outcome means that processor 0 sees the stores as time(x) < time(y), while processor 1 sees the opposite.

This is a very important observation about the consistency of this memory model, which the previous example doesn't prove. This nuance is the biggest difference between Sequential Consistency and Total Store Ordering - the second example breaks SC, the first one doesn't.

edited Jan 04 '14 at 12:26

answered Jan 03 '14 at 16:47

Leeor

19,260
5
56
87

You say that "the newly added loads in the middle both return 1". If that is indeed true, then the example makes more sense. However, I don't see it ("return 1") in the text. Where is it? – Jan 03 '14 at 17:50
@Arkadiy, I guess it's assumed implicitly, since they talk about forwarding. If it didn't return 1 after just having stored it, you would break all coherency – Leeor Jan 03 '14 at 18:57
2

Great answer. I find this is little understood, and the answer hits it on the head: there are _two_ types of re-odering in x86: the `StoreLoad` reordering (pretty much required if you have a store buffer) demonstrated by `8.2.3.4` and this "store forwarding" reordering, which isn't cleanly explicable in terms of the 4 standard re-orderings. You really have to just explain it "loads can take their value from an earlier store from the same CPU, apparently out of order (reading a later value) with respect to surrounding loads". Or something. – BeeOnRope Jun 01 '18 at 01:21
1

I do disagree about one thing: the `StoreLoad` reordering in 8.2.3.4 _already_ breaks sequential consistency, because SC requires that each operation appear in program order in the otherwise arbitrary total order of operations. To get the result `r1==r2==0` there is no ordering consistent with the program order that produces it - at least one of the reads must be reodered with the write on from the same CPU. – BeeOnRope Jun 01 '18 at 01:29
@BeeOnRope, i'm not sure SC cares about the original program order, what's wrong with assuming both loads in 8.2.3.4 are performed first in the global order? It doesn't break any rule. By the way, i've seen a distinction made between the SC concept used in some places (including c++ SC memory model, I think) where the global order of refers to *modifications* seen by everyone, unlike the classic definition of SC where it includes all operations (including loads). – Leeor Jun 01 '18 at 07:27
2

@Leeor - everything I've seen about SC says it preserves about problem order - including the link you have above. If it doesn't respect program order, it is a very weak model indeed and you can't reason easily about it. In practice, it's the strongest model and easy to reason about for that reason. The only non-determinism in SC is the relative order of operations from different threads. – BeeOnRope Jun 01 '18 at 15:52

x86 memory ordering: Loads Reordered with Earlier Stores vs. Intra-Processor Forwarding

1 Answers1

Linked