On x86-64 we have movntdq for "non temporal" (uncached) stores directly to main memory. And we have prefetcht2 to prefetch into to L3 but not L1/L2.
But can we store to L3? It seems like this should be useful where data is produced by one core and consumed on another--there is no use putting it in L1/L2 cache on the producer core because it will never be read from there.
If the answer is that this is not supported on x86-64, I'd be curious to know if there's a specific reason it can't or shouldn't be done (e.g. it would not improve performance because of some reason I haven't thought of).