clflushopt
is a terrible idea for this use-case. Evicting lines from the cache before overwriting them is the opposite of what you want. If they're hot in cache, you avoid a RFO (read-for-ownership).
If you're using NT stores, they will evict any lines that were still hot so it's not worth spending cycles doing clflushopt
first.
If not, you're completely shooting yourself in the foot by guaranteeing the worst case. See Enhanced REP MOVSB for memcpy for more about writing to memory, and RFO vs. no-RFO stores. (e.g. rep movsb
can do no-RFO stores on Intel at least, but still leave the data hot in cache.) And keep in mind that an L3 hit can satisfy an RFO faster than going to DRAM.
If you're about to write a buffer with regular stores (that will RFO), you might prefetchw
on it to get it into Exclusive state in your L1D before you're ready to actually write.
It's possible that clwb
(Cache-Line Write Back (without evicting)) would be useful here, but I think prefetchw
will always be at least as good as that, if not better (especially on AMD where MOESI cache coherency can transfer dirty lines between caches, so you could get a line into your L1D that's still dirty, and be able to replace that data without ever sending the old data to DRAM.)
Ideally, malloc
will give you memory that's still hot in the L1D cache of the current core. If you're finding that a lot of the time, you're getting buffers that are still dirty and in L1D or L2 on another core, then look into a malloc with per-thread pools or some kind of NUMA-like thread awareness.
As I understood, after _mm_clflushopt()
I need to call _mm_sfence()
to make its non-temporal stores visible to other cores/processors.
No, don't think of clflushopt
as a store. It's not making any new data globally visible, so it doesn't interact with the global ordering of memory operations.
sfence
makes your thread's later stores wait until the flushed data is flushed all the way to DRAM or memory-mapped non-volatile storage.
If you're flushing lines that are backed by regular DRAM, you only need sfence
before a store that will initiate a non-coherent DMA operation that will read DRAM contents without checking cache. Since other CPU cores do always go through cache, sfence
is not useful or necessary for you. (Even if clflushopt
was a good idea in the first place.)
Even if you were talking about actual NT stores, other cores will eventually see your stores without sfence
. You only need sfence
if you need to make sure they see your NT stores before they see some later stores. I explained this in Make previous memory stores visible to subsequent memory loads
can something bad happen?
No, clflushopt
doesn't affect cache coherency. It just triggers write-back (& eviction) without making later stores/loads to wait for it.
You could clflushopt
memory allocated and in use by another thread without affecting correctness.