8

Is there a hint I can put in my code indicating that a line should be removed from cache? As opposed to a prefetch hint, which would indicate I will soon need a line. In my case, I know when I won't need a line for a while, so I want to be able to get rid of it to free up space for lines I do need.

Elliot Gorokhovsky
  • 3,610
  • 2
  • 31
  • 56
  • You could try using `prefetchnta` before reading the memory. But unfortunately SSE4.1 streaming loads (`movntdqa`) don't seem to have any effect on write-back memory on CPUs like Intel Skylake desktop; the non-temporal hint seems to be ignored, instead of doing anything cool like inserting newly-allocated cache lines into the LRU position (next to be evicted). – Peter Cordes Jul 01 '17 at 22:21

1 Answers1

10

clflush, clflushopt

Invalidates from every level of the cache hierarchy in the cache coherence domain the cache line that contains the linear address specified with the memory operand. If that cache line contains modified data at any level of the cache hierarchy, that data is written back to memory.

They are not available on every CPU (in particular, clflushopt is only available on the 6th generation and later). To be certain, you should use CPUID to verify their availability:

The availability of CLFLUSH is indicated by the presence of the CPUID feature flag CLFSH (CPUID.01H:EDX[bit 19]).

The availability of CLFLUSHOPT is indicated by the presence of the CPUID feature flag CLFLUSHOPT (CPUID.(EAX=7,ECX=0):EBX[bit 23]).

If available, you should use clflushopt. It outperforms clflush when flushing buffers larger than 4KiB (64 lines).

This is the benchmark from Intel's Optimization Manual:clflush (modified) is 111 cycles/line, clflush (shared) is 38 cycles/line, clflushopt (modified) is 17 cycles/line, clflushopt (shared) is 4 cycles/line

For informational purpose (assuming you are running in a privileged context), you can also use invd (as a nuke-from-orbit option). This:

Invalidates (flushes) the processor’s internal caches and issues a special-function bus cycle that directs external caches to also flush themselves. Data held in internal caches is not written back to main memory.

or wbinvd, which:

Writes back all modified cache lines in the processor’s internal cache to main memory and invalidates (flushes) the internal caches. The instruction then issues a special-function bus cycle that directs external caches to also write back modified data and another bus cycle to indicate that the external caches should be invalidated.

A future instruction that could make it into the ISA is club. Although this won't fit your need (because it doesn't necessarily invalidate the line), it's worth mentioning for completeness. This would:

Writes back to memory the cache line (if dirty) that contains the linear address specified with the memory operand from any level of the cache hierarchy in the cache coherence domain. The line may be retained in the cache hierarchy in non-modified state. Retaining the line in the cache hierarchy is a performance optimization (treated as a hint by hardware) to reduce the possibility of cache miss on a subsequent access. Hardware may choose to retain the line at any of the levels in the cache hierarchy, and in some cases, may invalidate the line from the cache hierarchy.

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
Margaret Bloom
  • 41,768
  • 5
  • 78
  • 124
  • @RenéG CodyGray did a great edit of my raw answer, a lot goes to him :) – Margaret Bloom Jun 08 '17 at 14:40
  • 1
    Just grammar fixes. I'm a bit of a stickler, especially on canonical references like the ones you regularly churn out! – Cody Gray - on strike Jun 09 '17 at 09:32
  • Is there any option to flush a cache line from levels closer to the CPU core? e.g to flush it from L1 only, or L1 and L2, but keeping it in L3. – Will Feb 04 '20 at 18:25
  • @Will There's a `cldemote` instruction but it has not yet made it to the mainstream ISA. It's in the set extensions reference manual. `clwb` writes back a modified line but may keep it in the original cache *or in a more distant cache*. That's unreliable for your purpose. What are the use cases for demoting a cache line? Lines are automatically evicted, so it's for saving bandwidth I guess (i.e. writing a modified line now rather than when making room for a new line). – Margaret Bloom Feb 04 '20 at 18:39
  • @MargaretBloom I'm processing one data stream (let’s call it S1) sequentially; and sporadically I have to process a random CL from a 2nd data stream (S2), that I previously prefetched. So in worst cases processing too much lines from S1 could cause the prefetched S2 lines to be evicted. So I thought that I could force S1 lines to be discarded right after I know they are not going to be needed anymore. After all the processing finished, S1 is accessed sequentially again, that is why I want to keep it in L3 cache. – Will Feb 11 '20 at 18:34
  • @Will AFAIK you cannot have a line in L3 but not in L1 *unless* the L3 is not inclusive (e.g. like in the Xeon Scalable) and if the L3 is not inclusive the evicted lines from L1 would be moved there automatically. If the L3 is inclusive, the L2 is not and the evicted lines will be moved there. So in short, maybe you don't have to do anything :) – Margaret Bloom Feb 11 '20 at 21:33