What (bad) can happen if I don't issue _mm_sfence() after _mm_clflushopt()?

Question

I'm evicting a memory range from CPU cache before freeing the memory. Ideally I would like to just abandon these cache lines without saving them to memory. Because noone is going to use those values, and whoever obtains again that memory range (after malloc()/new/_mm_malloc() etc.) will first fill the memory with new values. As this question suggests, there seems no way to achieve the ideal on x86_64 currently.

Therefore I'm doing _mm_clflushopt(). As I understood, after _mm_clflushopt() I need to call _mm_sfence() to make its non-temporal stores visible to other cores/processors. But in this specific case I don't need its stores.

So if I just don't call _mm_sfence(), can something bad happen? E.g. if some other core/processor manages to allocate that memory range again quickly enough, and starts filling it with new data, can it happen that the new data gets concurrently overwritten by the old cache being flushed by the current core?

EDIT: the quick subsequent allocation is unlikely, I'm just describing this case because I need the program to be correct there too.

May I ask, what is the purpose of evicting the cache in the first place? — Passer By, Sep 01 '17 at 16:57
@PasserBy, to let it ASAP become occupied with data that will be really used next. Without explicitly evicting it, the CPU will keep the old (no more used) data in the cache until the old data becomes least recently used (LRU) w.r.t. the other data in the cache. — Serge Rogatch, Sep 01 '17 at 17:06
If it is provable that the contents in cache is no longer needed, and evicting it is both possible and provides a speedup, wouldn't that already be included in the optimizer? — Passer By, Sep 01 '17 at 17:27
@PasserBy, maybe it's just too hard to prove, so optimizers just don't go that far. Simplifying, it depends case to case: sometimes you will soon do allocations on the same address range, sometimes you don't expect to allocate anything. So it's easier for the programmer to decide case by case. But we can submit a proposal for GCC/Clang :) — Serge Rogatch, Sep 01 '17 at 18:34

Peter Cordes · Answer 1 · 2017-09-01T18:35:52.943

clflushopt is a terrible idea for this use-case. Evicting lines from the cache before overwriting them is the opposite of what you want. If they're hot in cache, you avoid a RFO (read-for-ownership).

If you're using NT stores, they will evict any lines that were still hot so it's not worth spending cycles doing clflushopt first.

If not, you're completely shooting yourself in the foot by guaranteeing the worst case. See Enhanced REP MOVSB for memcpy for more about writing to memory, and RFO vs. no-RFO stores. (e.g. rep movsb can do no-RFO stores on Intel at least, but still leave the data hot in cache.) And keep in mind that an L3 hit can satisfy an RFO faster than going to DRAM.

If you're about to write a buffer with regular stores (that will RFO), you might prefetchw on it to get it into Exclusive state in your L1D before you're ready to actually write.

It's possible that clwb (Cache-Line Write Back (without evicting)) would be useful here, but I think prefetchw will always be at least as good as that, if not better (especially on AMD where MOESI cache coherency can transfer dirty lines between caches, so you could get a line into your L1D that's still dirty, and be able to replace that data without ever sending the old data to DRAM.)

Ideally, malloc will give you memory that's still hot in the L1D cache of the current core. If you're finding that a lot of the time, you're getting buffers that are still dirty and in L1D or L2 on another core, then look into a malloc with per-thread pools or some kind of NUMA-like thread awareness.

As I understood, after _mm_clflushopt() I need to call _mm_sfence() to make its non-temporal stores visible to other cores/processors.

No, don't think of clflushopt as a store. It's not making any new data globally visible, so it doesn't interact with the global ordering of memory operations.

sfence makes your thread's later stores wait until the flushed data is flushed all the way to DRAM or memory-mapped non-volatile storage.

If you're flushing lines that are backed by regular DRAM, you only need sfence before a store that will initiate a non-coherent DMA operation that will read DRAM contents without checking cache. Since other CPU cores do always go through cache, sfence is not useful or necessary for you. (Even if clflushopt was a good idea in the first place.)

Even if you were talking about actual NT stores, other cores will eventually see your stores without sfence. You only need sfence if you need to make sure they see your NT stores before they see some later stores. I explained this in Make previous memory stores visible to subsequent memory loads

can something bad happen?

No, clflushopt doesn't affect cache coherency. It just triggers write-back (& eviction) without making later stores/loads to wait for it.

You could clflushopt memory allocated and in use by another thread without affecting correctness.

I don't expect something else to get allocated soon. That's just an example where I doubt my program to be correct, so I need to clarify what happens in that case. — Serge Rogatch, Sep 01 '17 at 18:30
@SergeRogatch: If you `clflushopt` some memory and free it right away, it might be possible that another thread could get it from malloc and have some data it stored flushed to memory. (But probably not because `free` probably needed to use a `lock`ed operation to add the memory to a global free list, which would be a barrier for `clflushopt`.) But even if it could, that doesn't affect correctness. The data it stored is still there. You could `clflushopt` memory allocated and in use by another thread without affecting correctness. — Peter Cordes, Sep 01 '17 at 18:34

What (bad) can happen if I don't issue _mm_sfence() after _mm_clflushopt()?

1 Answers1