Non-temporal store to bypass L1/L2 but cache in L3 to be read by another core

Question

On x86-64 we have movntdq for "non temporal" (uncached) stores directly to main memory. And we have prefetcht2 to prefetch into to L3 but not L1/L2.

But can we store to L3? It seems like this should be useful where data is produced by one core and consumed on another--there is no use putting it in L1/L2 cache on the producer core because it will never be read from there.

If the answer is that this is not supported on x86-64, I'd be curious to know if there's a specific reason it can't or shouldn't be done (e.g. it would not improve performance because of some reason I haven't thought of).

Unfortunately no, and `cldemote` isn't available yet on mainstream CPUs. (Just Tremont, I think). This question ([CPU cache inhibition](https://stackoverflow.com/a/47102784)) is looking for the exact same thing you are (store to L3). @prl's answer there mentions cldemote, and other answers (including mine) cover other aspects. — Peter Cordes, Jan 20 '21 at 01:15
Having a store that only updated L3 without disturbing local L1d and L2 would need a special hardware mechanism just to support it. (And it would probably hardly get any use for a long time if introduced now). And might only be efficient for full-line stores the way NT stores are, although I guess partial-line LFB eviction for NT stores does have to handle sending a store out towards memory with a mask of which bits are valid. — Peter Cordes, Jan 20 '21 at 01:31
Thanks @PeterCordes. It'd be fine with me if it only supported full-line stores. `cldemote` seems like a performance optimization for the reader to see the data sooner, not for the writer to avoid L1/L2 pollution. — John Zwinck, Jan 20 '21 at 04:46
Unfortunately for you, there isn't anything that supports even full-line stores, sorry if I got your hopes up. I was just thinking out loud about design challenges for a hypothetical instruction to do this. With the major one being that it's yet another kind of store that the interconnect has to know about, so it'd be additional complexity everywhere — Peter Cordes, Jan 20 '21 at 04:51
You might get some pollution *reduction* by doing cldemote after storing a line, hopefully limiting pollution to one way in each set instead of letting pseudo-LRU eviction keep the recently written "useless" line next time you alias that set. Full pollution avoidance is impossible, except with `movnt` which also bypasses L3. And BTW, the usual reason people ask about doing this is to reduce cache-miss latency on the read side; I think that's the intended use-case for cldemote. — Peter Cordes, Jan 20 '21 at 04:55

Non-temporal store to bypass L1/L2 but cache in L3 to be read by another core

0 Answers0