Why wasn't MASKMOVDQU extended to 256-bit and 512-bit stores?

Question

The MASKMOVDQU¹ is special among x86 store instructions because, in principle, it allows you to store individual bytes in a cache line, without first loading the entire cache line all the way to the core so that the written bytes can be merged with the not-overwritten existing bytes.

It would seem to work using the same mechanisms as an NT store: pushing the cache line down without first doing an RFO. Per the Intel software develope manual (emphasis mine):

The MASKMOVQ instruction can be used to improve performance for algorithms that need to merge data on a byteby-byte basis. It should not cause a read for ownership; doing so generates unnecessary bandwidth since data is to be written directly using the byte-mask without allocating old data prior to the store.

Unlike other NT stores, however, you can use a mask to specify which bytes are actually written.

In the case that you want to make sparse byte-granular writes across a large region which isn't likely to fit in any level of the cache, this instruction seems idea.

Unlike almost every other useful instruction, Intel haven't extended the instruction to 256 or 512 bits in AVX/AVX2 or AVX-512. Does this indicate that the use of this instruction is no longer recommended, perhaps cannot be implemented efficiently on current or future architectures?

¹ ... and its 64-bit predecessor in MMX MASKMOVQ.

Peter Cordes · Answer 1 · 2021-09-22T14:25:49.003

MASKMOVDQU is indeed slow and probably never a good idea, like 1 per 6 cycle throughput on Skylake or one per 18c on Zen2 / Zen3.

I suspect that masked NT vector stores no longer work well for multi-core CPUs, so probably even the 128-bit version just sucks on modern x86 for masked writes, if there are any unmodified bytes in a full 64-byte line.

Regular (not NT) masked vector stores are back with a vengeance in AVX512. Masked commit to L1d cache seems to be efficiently supported for that, and for dword / qword masking with AVX1 vmaskmovps/pd and integer equivalent on Intel CPUs. (Although not AMD: AMD only has efficient masked AVX1/2 loads, not stores. https://uops.info/table.html shows VPMASKMOVD M256, YMM, YMM on Zen3 is 42 uops, 12c throughput, about the same as Zen2. vs. 3 uops, 1c latency on Skylake. Masked loads are fine on AMD, 1 uop 0.5c throughput, so actually better than Skylake for the AVX2 versions. Probably Skylake internally does a compare-into-mask and uses the HW designed for AVX-512.)

AVX512F made masking with dword/qword granularity a first-class citizen with very efficient support for both loads and stores. AVX512BW adds 8 and 16-bit element size, including masked load/store like vmovdqu8 which is also efficiently supported on Intel hardware; single uop even for stores.

The SDRAM bus protocol does support byte-masked writes (with 1 mask line per byte as part of a cache-line burst transfer). This Intel doc (about FPGAs or something) includes discussion of the DM (data mask) signals, confirming that DDR4 still has them, with the same function as the DQM lines described on Wikipedia for SDRAM https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory#SDR_SDRAM. (DDR1 changed it to write-mask only, not read-mask.)

So the hardware functionality is there, and presumably modern x86 CPUs use it for single-byte writes to uncacheable memory, for example.

(Update: byte-masking may only be optional in DDR4, unlike some earlier SDRAM / DDR versions. In that case, the store could get to the memory controller in masked form, but the memory controller would have to read/modify/write the containing 8 byte chunk(s) using separate burst-read and burst-write commands to the actual DIMM. Chopping the bursts short is possible for stores that only affect part of a 64-byte DDR burst-size, saving some data bandwidth, but there's still the command overhead and taking buffer space in the mem controller for longer.)

No-RFO stores are great if we write a full line: we just invalidate other copies of the line and store to memory.

John "Dr. Bandwidth" McCalpin says that normal NT stores that flush after filling a full 64-byte line will invalidate even lines that are dirty, without causing a writeback of the dirty data.

So masked NT stores need to use a different mechanism, because any masked-out bytes need to take their value from the dirty line in another core, not from whatever was in DRAM.

If the mechanism for partial-line NT stores isn't efficient, adding new instructions that create it is unwise. I don't know if it's more or less efficient than doing normal stores to part of a line, or if that depends on the situation and uarch.

It doesn't have to be a RFO exactly, but it would mean that when such a store reaches the memory controller, it would have to get the snoop filter to make sure the line is in sync, or maybe merge with the old contents from cache before flushing to DRAM.

Or the CPU core could do an RFO and merge, before sending the full-line write down the memory hierarchy.

CPUs do already need some kind of mechanism for flushing partial-line NT stores when reclaiming an LFB that hasn't had all 64 bytes written yet, and we know that's not as efficient. (But I forget the details.) But maybe this is how maskmovdqu executes on modern CPUs, either always or if you leave any bytes unmodified.

An experiment could probably find out.

So TL:DR maskmovqdu may have only been implemented efficiently in single-core CPUs. It originated in Katmai Pentium III with MMX maskmovq mm0, mm1; SMP systems existed, but maybe weren't the primary consideration for this instruction when it was being designed. SMP systems didn't have shared last-level cache, but they did still have private write-back L1d cache on each socket.

@HadiBrais: I left out some context. He was talking only about a full-line of NT stores. Partial line NT stores don't violate coherence that way, that would be utterly broken. — Peter Cordes, May 23 '19 at 23:59
@HadiBrais: I mean 4 contiguous `movntdq` / `movntps` stores that fill a line, or 2 AVX NT stores, or 1 [`vmovntps zmm` AVX512 NT store](https://www.felixcloutier.com/x86/movntps). — Peter Cordes, May 24 '19 at 00:03
That's fine then. I don't think it matters from a correctness perspective whether the line is evicted or invalidated in case the whole cache line is modified. I don't see the connection with the following statement. Even without masked stores, a sequence of normal NT stores could have the same effect as an a single masked store. Once the store is performed on the LFB, it works exactly the same whether it was masked or not. — Hadi Brais, May 24 '19 at 00:09
@HadiBrais: Yes exactly, there's no correctness problem. My point was that the mechanism for flushing partial-line NT stores may not be particularly efficient. Therefore, adding an instruction which creates that condition would not be helpful. But yes, I expect you're right that 3 NT stores leaving one 16-byte chunk unmodified could create the same situation as using masked NT stores. (Do we know that LFBs have byte granularity masking? That would make sense for byte stores to WC memory, and for `maskmovdqu`, and possibly other things) — Peter Cordes, May 24 '19 at 00:17
You have made multiple good points: (1) Yes, all Intel patents that discuss LFBs that I have seen show that they have byte granularity masking. I think this was needed on the Pentium Pro for UC and I/O accesses. Later when MASKMOVDQU was added in the Pentium 4, the masking feature of LFBs was used to implement the instruction. (2) Not encouraging masked NT stores seems like a good reason to not add wider versions of the instructions. In addition, wider versions may not really provide any measurable performance advantage compared to using a sequence of 2/4 128-bit NT stores. — Hadi Brais, May 24 '19 at 00:28
@HadiBrais: The use case for wider (if it was worth using 128-bit) would be if you were doing something computationally intensive with vectors, it costs extra uops to vextracti128 the data and masks down to 128-bit for AVX1 `vmaskmovdqu xmm`. BTW, the first appearance of the basic instruction in another form was Katmai PIII, for the MMX version (along with SSE1). But yes, same difference, thanks for confirming byte masking in LFBs; I thought I remembered that being an established fact. — Peter Cordes, May 24 '19 at 00:34

score 1 · Answer 2 · answered Mar 01 '19 at 23:31

1

The description is misleading. The non-temporal aspect of MASKMOVQ is that it doesn't generate a RFO if you write the entire line. If you use the masked aspect, you still need to RMW, in which case you could just use the AVX-512 mask register.

answered Mar 01 '19 at 23:31

sdfgg353te5124

11
1

Are you sure? The SDRAM (including DDR4) bus protocol does support byte-masked writes. [This Intel doc](https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/external-memory/emi_plan.pdf#page=15) (about FPGAs or something) includes discussion of the `DM` (data mask) signals, confirming that DDR4 still has them, with the same function as the DQM lines described on Wikipedia for SDRAM https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory#SDR_SDRAM. (DDR1 changed it to write-mask only, not read-mask.) – Peter Cordes May 23 '19 at 21:29
1

NT stores never generate RFOs regardless of masks on all Intel and AMD processors. – Hadi Brais May 23 '19 at 23:58

Why wasn't MASKMOVDQU extended to 256-bit and 512-bit stores?

2 Answers2

Linked