2

On x86-64 we have movntdq for "non temporal" (uncached) stores directly to main memory. And we have prefetcht2 to prefetch into to L3 but not L1/L2.

But can we store to L3? It seems like this should be useful where data is produced by one core and consumed on another--there is no use putting it in L1/L2 cache on the producer core because it will never be read from there.

If the answer is that this is not supported on x86-64, I'd be curious to know if there's a specific reason it can't or shouldn't be done (e.g. it would not improve performance because of some reason I haven't thought of).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • 1
    Unfortunately no, and `cldemote` isn't available yet on mainstream CPUs. (Just Tremont, I think). This question ([CPU cache inhibition](https://stackoverflow.com/a/47102784)) is looking for the exact same thing you are (store to L3). @prl's answer there mentions cldemote, and other answers (including mine) cover other aspects. – Peter Cordes Jan 20 '21 at 01:15
  • 1
    Having a store that only updated L3 without disturbing local L1d and L2 would need a special hardware mechanism just to support it. (And it would probably hardly get any use for a long time if introduced now). And might only be efficient for full-line stores the way NT stores are, although I guess partial-line LFB eviction for NT stores does have to handle sending a store out towards memory with a mask of which bits are valid. – Peter Cordes Jan 20 '21 at 01:31
  • Thanks @PeterCordes. It'd be fine with me if it only supported full-line stores. `cldemote` seems like a performance optimization for the reader to see the data sooner, not for the writer to avoid L1/L2 pollution. – John Zwinck Jan 20 '21 at 04:46
  • Unfortunately for you, there isn't anything that supports even full-line stores, sorry if I got your hopes up. I was just thinking out loud about design challenges for a hypothetical instruction to do this. With the major one being that it's yet another kind of store that the interconnect has to know about, so it'd be additional complexity everywhere – Peter Cordes Jan 20 '21 at 04:51
  • 2
    You might get some pollution *reduction* by doing cldemote after storing a line, hopefully limiting pollution to one way in each set instead of letting pseudo-LRU eviction keep the recently written "useless" line next time you alias that set. Full pollution avoidance is impossible, except with `movnt` which also bypasses L3. And BTW, the usual reason people ask about doing this is to reduce cache-miss latency on the read side; I think that's the intended use-case for cldemote. – Peter Cordes Jan 20 '21 at 04:55

0 Answers0