MASKMOVDQU
is indeed slow and probably never a good idea, like 1 per 6 cycle throughput on Skylake or one per 18c on Zen2 / Zen3.
I suspect that masked NT vector stores no longer work well for multi-core CPUs, so probably even the 128-bit version just sucks on modern x86 for masked writes, if there are any unmodified bytes in a full 64-byte line.
Regular (not NT) masked vector stores are back with a vengeance in AVX512. Masked commit to L1d cache seems to be efficiently supported for that, and for dword / qword masking with AVX1 vmaskmovps/pd
and integer equivalent on Intel CPUs. (Although not AMD: AMD only has efficient masked AVX1/2 loads, not stores. https://uops.info/table.html shows VPMASKMOVD M256, YMM, YMM
on Zen3 is 42 uops, 12c throughput, about the same as Zen2. vs. 3 uops, 1c latency on Skylake. Masked loads are fine on AMD, 1 uop 0.5c throughput, so actually better than Skylake for the AVX2 versions. Probably Skylake internally does a compare-into-mask and uses the HW designed for AVX-512.)
AVX512F made masking with dword/qword granularity a first-class citizen with very efficient support for both loads and stores. AVX512BW adds 8 and 16-bit element size, including masked load/store like vmovdqu8
which is also efficiently supported on Intel hardware; single uop even for stores.
The SDRAM bus protocol does support byte-masked writes (with 1 mask line per byte as part of a cache-line burst transfer). This Intel doc (about FPGAs or something) includes discussion of the DM
(data mask) signals, confirming that DDR4 still has them, with the same function as the DQM lines described on Wikipedia for SDRAM https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory#SDR_SDRAM. (DDR1 changed it to write-mask only, not read-mask.)
So the hardware functionality is there, and presumably modern x86 CPUs use it for single-byte writes to uncacheable memory, for example.
(Update: byte-masking may only be optional in DDR4, unlike some earlier SDRAM / DDR versions. In that case, the store could get to the memory controller in masked form, but the memory controller would have to read/modify/write the containing 8 byte chunk(s) using separate burst-read and burst-write commands to the actual DIMM. Chopping the bursts short is possible for stores that only affect part of a 64-byte DDR burst-size, saving some data bandwidth, but there's still the command overhead and taking buffer space in the mem controller for longer.)
No-RFO stores are great if we write a full line: we just invalidate other copies of the line and store to memory.
John "Dr. Bandwidth" McCalpin says that normal NT stores that flush after filling a full 64-byte line will invalidate even lines that are dirty, without causing a writeback of the dirty data.
So masked NT stores need to use a different mechanism, because any masked-out bytes need to take their value from the dirty line in another core, not from whatever was in DRAM.
If the mechanism for partial-line NT stores isn't efficient, adding new instructions that create it is unwise. I don't know if it's more or less efficient than doing normal stores to part of a line, or if that depends on the situation and uarch.
It doesn't have to be a RFO exactly, but it would mean that when such a store reaches the memory controller, it would have to get the snoop filter to make sure the line is in sync, or maybe merge with the old contents from cache before flushing to DRAM.
Or the CPU core could do an RFO and merge, before sending the full-line write down
the memory hierarchy.
CPUs do already need some kind of mechanism for flushing partial-line NT stores when reclaiming an LFB that hasn't had all 64 bytes written yet, and we know that's not as efficient. (But I forget the details.) But maybe this is how maskmovdqu
executes on modern CPUs, either always or if you leave any bytes unmodified.
An experiment could probably find out.
So TL:DR maskmovqdu
may have only been implemented efficiently in single-core CPUs. It originated in Katmai Pentium III with MMX maskmovq mm0, mm1
; SMP systems existed, but maybe weren't the primary consideration for this instruction when it was being designed. SMP systems didn't have shared last-level cache, but they did still have private write-back L1d cache on each socket.