In C++ (using any of the low level intrinsics available on the platform) for x86 hardware (say Intel Skylake for example), is it possible to send a cacheline to another core without forcing the thread on that core to load the line explicitly?
My usecase is in a concurrent data-structure. In this, for some cases a core goes through some places in memory that might be owned by some other core(s) while probing for spots. The threads on those cores are typically are blocked on a condition variable, so they have some spare cycles where they can run additional "useful work". One example of "useful work" here might be that they stream the data to the other core that will load them in the future so the loading core doesn't have to wait for the line to come into it's cache before processing it. Is there some intrinsic/instruction available on x86 hardware where this can be possible?
A __builtin_prefetch didn't work really well because for some reason, it ends up adding that latency back to the code doing the loading :( Maybe the strides were not well configured, but I haven't been able to get good strides so far. This might be better handled, and deterministically from the other cores that know their lines might be loaded eventually.