Flush the write-combining buffer

Question

On an x86_64 CPU, I have marked some page table entries as write-combining. These pages are backed by a PCIe BAR. After I am finished storing to the memory in the pages, how can I flush the write-combining buffer? Intuitively, it seems like an sfence (or mfence) should do this but I am not sure that a flush is guaranteed. The Intel manual says the following about the sfence instruction:

Orders processor execution relative to all memory stores prior to the SFENCE instruction. The processor ensures that every store prior to SFENCE is globally visible before any store after SFENCE becomes globally visible. The SFENCE instruction is ordered with respect to memory stores, other SFENCE instructions, MFENCE instructions, and any serializing instructions (such as the CPUID instruction). It is not ordered with respect to memory loads or the LFENCE instruction.

It is unclear what "globally visible" means. Does this mean all of the other cores in the CPU or does it mean the entire system, including I/O devices (in which case a flush would be required)?

Alternatively, is a clflush what I want?

Do you have some ordering requirement wrt. later loads or stores done by this thread? If not, you don't need to do anything. The stores definitely will happen, but perhaps not until after some other stores (and loads) from this thread become visible to other things in the system (cores and devices). `sfence` is sufficient to block reordering of stores vs. other stores, e.g. after some weakly-ordered stores, before writing a `done=1` flag. — Peter Cordes, Jun 16 '22 at 00:55
@PeterCordes Thanks Peter. I am not concerned about ordering here. Rather, I want to flush the buffer since it is critical that the writes reach the PCIe device with the least amount of latency possible. I mark the PTEs as write-combining since several writes occur per cache line and I don't want to wait for every individual write to initiate a bus txn, but once I'm done writing the cache line, I want the writes to reach the PCIe device as soon as possible. — Jack Humphries, Jun 16 '22 at 02:44
After a write-combining buffer (LFB on Intel) has had all its bytes written at least once (a full-line write), my understanding is that it will send itself off-core immediately. Before that, it would be a partial-line write, and only happens if evicted by demand for LFBs for other incoming/outgoing lines. So if your stores are a multiple of 64 bytes, and aligned by 64, there's probably nothing to be gained with `sfence`; if you can benchmark latency, I'd expect it to make no difference then. But maybe some effect if there's any 64-byte line that's only partially written. — Peter Cordes, Jun 16 '22 at 02:52
Related: [Does L1 cache accept new incoming requests while its Line Fill Buffers (LFBs) are fully exhausted?](https://stackoverflow.com/q/72201697) shows some performance experiments involving NT stores to WB memory, which I *think* should be like stores to WC memory, although obviously the actual cost of commit is different for DRAM vs. PCIe. But the important thing is that writing the other half of the cache line with an NT store frees up the LFB much sooner, which is why I intentionally avoided doing so in that microbenchmark experiment. — Peter Cordes, Jun 16 '22 at 02:55
@PeterCordes Hi Peter -- do you know if it is possible to flush the write-combining buffer without stalling the pipeline? An sfence will stall the pipeline while the write-combining buffer is flushed -- which makes sense -- but I'm wondering if there is any way to flush it without stalling the pipeline. Thanks. — Jack Humphries, Nov 16 '22 at 03:49
`sfence` on Intel doesn't stall the pipeline, BTW, but yeah AMD gives it much stronger semantics. Perhaps a `lock or byte [rsp], 0` dummy locked operation, although that's a full memory barrier, too. Does `clflushopt` or `clwb` on a specific line not work? Or you want to flush any that might be outstanding without knowing the address? Just normal activity should result in them getting evicted soon even if partially written. Completing a full line (writing all 64 bytes) should result in the WC buffer flushing itself ASAP. — Peter Cordes, Nov 16 '22 at 04:17

Jack Humphries · Accepted Answer · 2022-06-16T02:47:02.643

3

In an answer to a different StackOverflow question, Paul A. Clayton includes the following quote from Section 11.3 of the Intel manual:

If the WC buffer is partially filled, the writes may be delayed until the next occurrence of a serializing event; such as, an SFENCE or MFENCE instruction, CPUID execution, a read or write to uncached memory, an interrupt occurrence, or a LOCK instruction execution.

Thus, Intel guarantees that either an sfence or an mfence will flush the write-combining buffer.

edited Jun 16 '22 at 02:47

answered Jun 16 '22 at 02:41

Jack Humphries

13,056
14
84
125

After `SFENCE` is completed, is the data in the write-combining buffer immediately flushed? – grayxu Apr 02 '23 at 11:06
@grayxu The sfence flushes the write-combining buffer, so once the sfence is complete, the buffer is already flushed. – Jack Humphries Apr 04 '23 at 02:59
Thank you for your response! As this question pertains to PCIe BAR, I am curious whether `sfence` operates synchronously or asynchronously in regards to flushing behavior on DDR-T of persistent memory (such as Intel DCPMM). According to the [pmdk doc](https://pmem.io/glossary/), sfence does guarantee persistency. However, [one research paper](https://dl.acm.org/doi/pdf/10.1145/3492321.3519556) claims that *with DDR-T, memory barriers only ensure that cacheline flushes are globally visible but not necessarily completed by the time a fence instruction returns.* – grayxu Apr 05 '23 at 09:40
@grayxu I don't know for sure, but I would suspect that `sfence` operates asynchronously. A trip across the PCIe bus is O(1 µs), but when I run `sfence`, my CPU does not stall for 2 µs (a round-trip for the write and a completion notification) or even 1 µs. I took a look at the pmdk doc you linked, and it seems that `sfence` *does* guarantee persistence in that case. This is because the `sfence` ensures the write is globally visible, so the eADR technology can see the write and will thus be able to flush the write from the CPU cache to the persistent memory _on power loss_. – Jack Humphries Apr 07 '23 at 07:24
@grayxu Thus, to be clear, both the pmdk doc and the research paper are correct. Their claims do not contradict one another. – Jack Humphries Apr 07 '23 at 07:28

Flush the write-combining buffer

1 Answers1