8

I am trying to find configuration or memory access pattern for Intel's clwb instruction that would not invalidate cache line. I am testing on Intel Xeon Gold 5218 processor with NVDIMMs. Linux version is 5.4.0-3-amd64. I tried using Device−DAX mode and directly mapping this char device to the address space. I also tried adding this non-volatile memory as a new NUMA node and using numactl --membind command to bind memory to it. In both cases when I use clwb to cached address, it is evicted. I am observing eviction with PAPI hardware counters, with disabled prefetchers.

This is a simple loop that I am testing. array and tmp variable, both are declared as volatile, so the loads are really executed.

for(int i=0; i < arr_size; i++){
    tmp = array[i];
    _mm_clwb(& array[i]);
    _mm_mfence();
    tmp = array[i];    
}

Both reads are giving cache misses.

I was wondering if anyone else has tried to detect whether there is some configuration or memory access pattern that would leave the cache line in the cache?

Ana Khorguani
  • 896
  • 4
  • 18
  • 1
    I think this was already clear to you, but Intel's definition of CLWB does not require that such a case exists on any particular platform. If I were trying to find such a case, I would test CLWB on ordinary, non-persistent memory, and I would try both normal and persistent memory accessed from both local and remote sockets. – John D McCalpin Feb 17 '20 at 17:04
  • @JohnDMcCalpin yes, I have read that. But if there is no case when the cache line is not invalidated, why would not they say so :D. I have already tested clwb with DRAM, gives same results. But I will try the mix access which did not occur to me before. Thanks a lot for the suggestion. – Ana Khorguani Feb 17 '20 at 20:45
  • You've confirmed that without `clwb`, your test measures near-zero cache misses? That would rule out a testing error. – Peter Cordes Feb 17 '20 at 23:46
  • 6
    It is possible that no current processors retain lines on which CLWB is used, but that future processors may behave differently. With the possible exception of ordering details, it is possible that CLWB is implemented using CLFLUSH in the current implementation. CLWB has some similarities to my patent (https://patents.google.com/patent/US20090216950), but I think that it exists just to make sure that dirty data has been written to persistent memory. – John D McCalpin Feb 18 '20 at 00:57
  • 2
    @PeterCordes yes, without clwb I get cache miss for the first read operation and then I get cache hit for the second read. I am evaluating for array size 100,000 for example and there is clear difference with and without using clwb instruction. – Ana Khorguani Feb 18 '20 at 19:26
  • 1
    @JohnDMcCalpin I see. Well basically clwb has exactly same behavior as clflushopt in skylake microarchitecture for example. As you say, all three make sure that dirty data has been written to persistent memory, but unlike clflush, clwb and clflushopt have almost no ordering constraints except fences. But it's a bit disappointing that there are two instructions doing same thing, and none of them leave cache line uninvalidated. So I was thinking maybe I am missing some configuration details or access pattern that leverages clwb to not invalidate cache lines. – Ana Khorguani Feb 18 '20 at 19:36
  • 2
    Agreed, it's disappointing. But it's still better that Intel introduced `clwb` in the first CPUs to support persistent memory so future libraries can use it without having to do dynamic dispatch based on CPUID, instead of waiting to introduce the instruction with CPUs that support it properly (no eviction). It'll make it much nicer in the long term once there are CPUs that support it. Thanks for posting about this SKX behaviour; like you I'd been assuming CLWB would do what it's designed for. Hopefully it's implemented soon, like Ice Lake. (If that even counts as soon for non-laptops...) – Peter Cordes Feb 19 '20 at 01:07
  • @PeterCordes Ah ok, I see your point. That makes sense. Yes, hopefully there will be a processor with a "real" clwb soon. – Ana Khorguani Feb 19 '20 at 12:03

1 Answers1

4

clwb behaves like clflushopt on SKX and CSL. However, programs that use clwb on these processors will automatically benefit when run on a future process that supports an optimized implementation of clwb.

clwb retains the cache line on ICL.

Note that cpuid leaf 0x7 information from InstLatx64 says that ICL doesn't support clwb, which is incorrect.

clwb is also supported on Zen 2, but I don't know how it works on this microarchitecture.

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
  • I have checked with SKX and CSL that clwb behaves like clflushopt, but is this like an official information or it's based on experiments? – Ana Khorguani Feb 21 '20 at 13:37
  • @AnaKhorguani Experiments. But it's also compatible with what the documentation says that it *may* retain the cache line in one or more levels of the cache hierarchy. – Hadi Brais Feb 21 '20 at 15:50
  • Ok, thanks. Well, that *may* is exactly my problem :D would have been much clearer just to say that from some microarchitecture it will not invalidate. – Ana Khorguani Feb 21 '20 at 16:08