4

In C++ (using any of the low level intrinsics available on the platform) for x86 hardware (say Intel Skylake for example), is it possible to send a cacheline to another core without forcing the thread on that core to load the line explicitly?

My usecase is in a concurrent data-structure. In this, for some cases a core goes through some places in memory that might be owned by some other core(s) while probing for spots. The threads on those cores are typically are blocked on a condition variable, so they have some spare cycles where they can run additional "useful work". One example of "useful work" here might be that they stream the data to the other core that will load them in the future so the loading core doesn't have to wait for the line to come into it's cache before processing it. Is there some intrinsic/instruction available on x86 hardware where this can be possible?


A __builtin_prefetch didn't work really well because for some reason, it ends up adding that latency back to the code doing the loading :( Maybe the strides were not well configured, but I haven't been able to get good strides so far. This might be better handled, and deterministically from the other cores that know their lines might be loaded eventually.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Curious
  • 20,870
  • 8
  • 61
  • 146
  • 1
    Interesting thought, although when you get down to the details it seems very complicated. How would you identify and encode the target core? How do you prevent the OS from swapping the thread from it? How do you prevent cache DDoS attacks. Seems like it should be managed by the SW layer (OS/VMM) – Leeor Aug 03 '19 at 19:27
  • @Leeor I didn't give enough context about the usecase, but I think this optimization will be so short-lived (eg. popping from a queue where someone is pushing on the other end) that any sort of scheduling will not impact the system enough to cause pathologies. What did you mean by DDoS cache attacks? – Curious Aug 05 '19 at 19:09
  • 1
    The user-level SW would still need to know where the OS decided to run each thread during that phase. You can control that with affinity but that creates other liabilities. The attack thing is maybe a little exaggerated but think of a user buying a core on a cloud system only to get another user constantly forcing some cache lines on him. – Leeor Aug 06 '19 at 23:54
  • @Leeor Makes sense. The code could in this case poll the core configuration from sched_getcpu() or numa_node_of_cpu(). Although that’s only a heuristic. – Curious Aug 06 '19 at 23:56

1 Answers1

8

There is no "push"; a cache line enters L1d on a physical core only after that core requests it. (Because of a load, SW prefetch, or even HW prefetch.)

2 logical cores can share the same physical core, in case that helps: it might be less horrible to wake up a prefetch-assistant thread to prime the case if latency of some future load is far more important than throughput. I'm picturing having the writer use a condition variable or send a POSIX signal, or write to a pipe, or something that will result in OS-assisted wakeup of another thread whose CPU affinity is set to one or both of the logical cores that some other thread you care about is also pinned to.


The best you can possibly do from the writer side is trigger write-back to shared (L3) cache so the other core can hit in L3 instead of finding it owned by some other core and having to wait for write-back too. (Or depending on the uarch, for direct core->core transfer)

e.g. on Ice Lake or later, use clwb to force a write-back, resulting in it being clean but still cached. (But note that forces it to go all the way to DRAM.) clwb on SKX does evict like clflushopt.

See also CPU cache inhibition where I suggested possibly using a memory region set to write-through caching, if that's possible under a mainstream OS. See also How to force cpu core to flush store buffer in c?

Or of course to pin both writer and reader to the same physical core so they communicate via L1d. But then they compete for execution resources.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847