6

I have read Intel 64 and IA-32 Architectures SDM vol 3A, 9.2 MEMORY ORDERING, but there was one question that kept bothering me.

If I first write to a memory address, then send an interprocessor interrupt(IPI) with x2APIC, that mean sending IPI doesn't need writing memory (just use wrmsr). Another core recive the IPI and read the memory, will it read the correct value?

For example:

Initially x = 0

Processor 0:

mov [ _x], 1
wrmsr       # use x2APIC to send IPI

Processor 1:

# resive IPI, in the interrupt service routine:
mov r1, [ _x]

Is r1 = 0 allowed ?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
untitled
  • 181
  • 6

2 Answers2

3

That is an interesting question. On the face of it, one would think that since WRMSR is a serializing instruction it flushes the preceding memory writes and all is well. Even then, to quote the manual:

These instructions force the processor to complete all modifications to flags, registers, and memory by previous instructions and to drain all buffered writes to memory before the next instruction is fetched and executed.

(Emphasis mine)

It doesn't say anything about the ordering with respect to sending the IPI as that is part of the current instruction, not the next one. So this theoretically means the other core could execute the mov r1, [ _x] while the originating core is still busy flushing stuff but is very unlikely given that the target core would need to service the interrupt which probably has a lot higher latency.

As @harold mentioned, this point is moot since WRMSR is not always serializing. Reading the footnote that I initially missed:

WRMSR to the IA32_TSC_DEADLINE MSR (MSR index 6E0H) and the X2APIC MSRs (MSR indices 802H to 83FH) are not serializing.

So there is absolutely no guarantee that the write to x is flushed.

Jester
  • 56,577
  • 4
  • 81
  • 125
  • 4
    It's possibly worse than that since: "An execution of WRMSR to any non-serializing MSR is not serializing. Non-serializing MSRs include ... any of the x2APIC MSRs" – harold May 28 '23 at 18:40
  • Nice find! That's what I get for not reading the footnote. Thanks! – Jester May 28 '23 at 18:41
  • Thanks for your answer! It seems to be more interesting now :) – untitled May 28 '23 at 18:43
  • 1
    So, will `mov [ _x], 1; sfence; wrmsr` make it safe? – untitled May 28 '23 at 18:56
  • The following, which is safe: `mov [ _x], 1; sfence; wrmsr`, `mov [ _x], 1; lfence; wrmsr`, `mov [ _x], 1; SERIALIZE; wrmsr`, `mov r1, 1; xchg [_x], r1 (or lock cmpxchg); wrmsr` ? – untitled May 29 '23 at 05:19
  • 1
    @untitled: No, think of `sfence` as just putting a divider on the "conveyor belt" that is the store buffer. It can exec and even retire from the ROB while older stores are still not committed to L1d cache. See [Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?](https://stackoverflow.com/q/27627969) – Peter Cordes May 30 '23 at 18:37
2

From Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1

11.12.3 MSR Access in x2APIC Mode

To allow for efficient access to the APIC registers in x2APIC mode, the serializing semantics of WRMSR are relaxed when writing to the APIC registers. Thus, system software should not use “WRMSR to APIC registers in x2APIC mode” as a serializing instruction. Read and write accesses to the APIC registers will occur in program order. A WRMSR to an APIC register may complete before all preceding stores are globally visible; software can prevent this by inserting a serializing instruction or the sequence MFENCE;LFENCE before the WRMSR.

The RDMSR instruction is not serializing and this behavior is unchanged when reading APIC registers in x2APIC mode. System software accessing the APIC registers using the RDMSR instruction should not expect a serializing behavior. (Note: The MMIO-based xAPIC interface is mapped by system software as an un-cached region. Consequently, read/writes to the xAPIC-MMIO interface have serializing semantics in the xAPIC mode.)

However, I still don't know if this will work with amd processors.

untitled
  • 181
  • 6
  • 3
    Using `xchg` for the store, followed by `lfence`, should be cheaper. `xchg` is a full barrier, and its own store commits to cache as part of execution since it's an atomic RMW. Following it with `lfence` should also work on AMD with Spectre mitigation enabled, which makes `lfence` work as an execution barrier like on Intel, instead of a `nop`. [Is LFENCE serializing on AMD processors?](https://stackoverflow.com/q/51844886) `mfence;lfence` will almost certainly work on AMD as well, if you want to use the slower sequence Intel's manual mentions. – Peter Cordes May 30 '23 at 18:34
  • @PeterCordes Thanks very much! What about `lock cmpxchg`, `lock add`, `lock addx`, `lock inc`, followed by `lfence` ? – untitled May 31 '23 at 09:41
  • 1
    Those are all equivalent to `xchg` with its implicit `lock` prefix in how they work as an atomic RMW. The fact that there's an ALU operation between the load + store doesn't make things any harder or easier for the microcode. – Peter Cordes May 31 '23 at 10:01