3

From the speech Herb Sutter in the figure of the slides on page 2: https://skydrive.live.com/view.aspx?resid=4E86B0CF20EF15AD!24884&app=WordPdf&wdo=2&authkey=!AMtj_EflYn2507c enter image description here

Here are shown separate cache-L1S and Store Buffer (SB).

1. In processors Intel x86 cache-L1 and Store Buffer - is the same thing?

And next slide: enter image description here

As we see from the next slide in the x86 is only possible following reordering. was:

MOV eax, [memory1] / / read
MOV [memory2], edx / / write
... / / MOV, MFENCE, ADD ... any other code

became:

MOV [memory2], edx / / write
MOV eax, [memory1] / / read
... / / MOV, MFENCE, ADD ... any other code

This is due to the unordered execution in the processor pipeline.

2. But can you show another example similar to this - how does affect on reordering Store Buffer?

3. And the main question - how to influences LFENCE and SFENCE on caches of neighboring cores?

Is correct to say that:

  1. SFENCE makes "push", ie makes flush for Store Buffer->L1, and then sends changes from the caches of Core0-L1/L2 to all other cores Core1/2/3...-L1/L2?
  2. LFENCE makes "pull", ie receives changes from caches of all other Core1/2/3...-L1/L2( and Store Buffer?) in our core Core0-L1/L2?
Alex
  • 12,578
  • 15
  • 99
  • 195

1 Answers1

3
  1. The store buffer is not a cache, it's an ordering queue. It holds pending stores, while the cache can be thought of as a logical part of memory (i.e. - everything in any of the caches is visible to all other agents and must answer correctly to snoops)

  2. Stores are not reordered, that would break memory ordering as they would become immediately visible (unlike loads who only affect internal state).

  3. fences do not work on caches, and have nothing to do with other cores. Caches are already fully visible and synched. fences only apply for execution order (in case it's done out-of-order internally), and therefore apply only for the current context.

Is correct to say that:

  1. SFENCE makes "push", ie makes flush for Store Buffer->L1, and then sends changes from the caches of Core0-L1/L2 to all other cores Core1/2/3...-L1/L2?
  2. LFENCE makes "pull", ie receives changes from caches of all other Core1/2/3...-L1/L2( and Store Buffer?) in our core Core0-L1/L2?

sfence/mfence would flush the store buffer as they won't allow pending speculative stores to remain (that's why they're fencing). However as I said - once they changes are in L1 they're already observable by anyone, they don't have to be flushed anywhere further away.

In the same sense, lfence doesn't "pull" anything, it just stalls the execution of all younger loads until the older ones (and the fence itself) have finished and committed. This will affect performance by serializing the loads, but would not otherwise protect you against any operation in other cores, unless you have another way to make sure any store you require would have been performed by then (and in that case - update the load result in time).

Leeor
  • 19,260
  • 5
  • 56
  • 87
  • 1
    Thanks! "3. fences do not work on caches" - But if I do `MOV [addr], reg`, this operation is writing register's value to the cache-L1 (and mark this cache-line as **M** odified), and can I do flush cache-L1->RAM by using `SFENCE`? – Alex Dec 02 '13 at 14:12
  • 1
    @Alex: No, to flush a cache you need a cache sync instruction like clflush (for a single line) or wbinvd (for the entire cache). sfence just ensures that all pending stores are written to the L1 cache. – Leeor Dec 02 '13 at 14:36
  • 1
    Ok. But if you say that "`sfence` just ensures that all pending stores are written to the L1 cache" and "Caches are already fully visible and synched", then all cores can see changes at the same time, but by the link we can read "Sequential ordering may be necessary for multiple producer-multiple consumer situations where **all consumers must observe the actions of all producers occurring in the same order**.", this mean that without `SFENCE` some cores may observe changes, but some does not (**ie not fully visible and synched**), and how affect `SFENCE` to this? – Alex Dec 02 '13 at 15:30
  • 1
    @Alex (sorry for the late response) - see http://stackoverflow.com/questions/20907811/x86-memory-ordering-loads-reordered-with-earlier-stores-vs-intra-processor-for/20908626#20908626, this is a good example how SC may be broken due to intra-core forwarding. The caches are coherent and observable so loads from unrelated cores should see the stores as consistent, but loads *from one of the involved cores* may get that order wrong. One of the other examples in the optimization guide shows exactly that with >2 processes. – Leeor Jan 07 '14 at 14:12
  • @Leeor : if sfence only affect the store buffer, does it means the synchronisation doesn’t works for modern multi‑sockets systems ? – user2284570 Feb 12 '17 at 17:29