Lock's semantic in intel architecture

Question

the effect of Locked instruction will serialize all operation on multi-processor system ?

from the following description, seems P6 and more recent cpu promise this rule:

Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchro- nize data written by one processor and read by another processor.

For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception. Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.

but another section, the LOCK on P6 and more recent processor, Cache-Locking will don't lock bus, so how it can promise serialize all operations ?

8.1.4 Effects of a LOCK Operation on Internal Processor Caches For the Intel486 and Pentium processors, the LOCK# signal is always asserted on the bus during a LOCK operation, even if the area of memory being locked is cached in the processor. For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is cached in the processor that is performing the LOCK operation as write-back memory and is completely contained in a cache line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory location internally and allow it’s cache coherency mechanism to ensure that the operation is carried out atomically. This operation is called “cache locking.” The cache coherency mechanism automatically prevents two or more processors that have cached the same area of memory from simultaneously modifying data in that area.

and here in memory order section, following sentence said Locked instruction have a total order:

In a multiple-processor system, the following ordering principles apply: • Individual processors use the same ordering principles as in a single-processor system. • Writes by a single processor are observed in the same order by all processors. • Writes from an individual processor are NOT ordered with respect to the writes from other processors. • Memory ordering obeys causality (memory ordering respects transitive visibility). • Any two stores are seen in a consistent order by processors other than those performing the stores • Locked instructions have a total order.

my confusion is, on P6 and modern cpu, the LOCKed instruction will serialize all load/store operation ?

in other words, Locked instruction has a total order ?

a example of three processor P1, P2, P3, P4 and memory location A, B:

P1 execute: Locked store to change A,

P2 execute: Locked store to change B,

if Locked store can't preserve total order, following will happen:

on P3, it can observe A changed --> B changed

on P4, it can observe B changed --> A changed

`lock` is fully serializing. Executing an instruction in a core can't affect the instructions executed in another core (without making the `lock` very expensive) but the instructions on the same thread are serialized. The `lock`, even when elided in the cache, will always act as it locked the "bus" (which is another as-if concept itself) since a line in the M or E state is only in the current core and it can delay snoops (from any agent, including PCIe masters) until the locked instruction has done. — Margaret Bloom, Sep 16 '20 at 08:49
@MargaretBloom: `lock` is fully serializing *for memory ordering*. But note that it's not what x86 calls a "serializing instruction" like `cpuid`: it doesn't block out-of-order execution (e.g. of later ALU instructions). [Are loads and stores the only instructions that gets reordered?](https://stackoverflow.com/a/50496379) shows the difference. So be careful with the word "serializing" - besides its common English / computing meaning, "serializing instruction" has a specific technical meaning in x86. Nothing you said was wrong, but there's possible confusion. — Peter Cordes, Sep 16 '20 at 11:29
@PeterCordes oh, right, thank you. I keep forgetting it's not fully serializing :/ — Margaret Bloom, Sep 16 '20 at 18:41
@PeterCordes intel's manual said, the lock implemented in cache-locking, is Cache-Coherency like MESI promise total order of two store operation ? how it implemented this ? — Chinaxing, Sep 17 '20 at 03:25
[What does it mean that "two store are seen in a consistent order by other processors"?](https://stackoverflow.com/q/63912669) explains how a total store order exists. To make it compatible with program-order for each core, commit from store buffer to L1d cache happens in program order. MESI requires that commit to cache can only happen when this core has exclusive ownership of a line. An atomic RMW can just stretch out that exclusive ownership, not replying to share requests or RFOs. [Can num++ be atomic for 'int num'?](https://stackoverflow.com/q/39393850) — Peter Cordes, Sep 17 '20 at 15:22

Lock's semantic in intel architecture

0 Answers0