1

On an Intel CPU, I want CPU core A to signal CPU core B when A has completed an event. There are a couple ways to do this:

  1. A sends an interrupt to B.
  2. A writes a cache line (e.g., a bit flip) and B polls the cache line.

I want B to learn about the event with the least amount of overhead possible. Note that I am referring to overhead, not end-to-end latency. It's alright if B takes a while to learn about the event (e.g., periodic polling is fine), but B should waste as few cycles as possible detecting the event.

Option 1 above has too much overhead due to the interrupt handler. Option 2 is better, but I am still unhappy with the amount of time that B must wait for the cache line to transfer from A's L1 cache to its own L1 cache.

Is there some way A can directly push the cache line into B's L1 cache? It's fine if there is additional overhead for A in this case. I'm not sure if there some trick I can try where A marks the page as uncacheable and B marks the page as write-back...

Alternatively, is there some other mechanism built into Intel processors that can help with this?

I assume this is less of an issue on AMD CPUs as they use the MOESI coherence protocol, so the "O" should presumably allow A to broadcast the cache line changes to B.

Jack Humphries
  • 13,056
  • 14
  • 84
  • 125
  • I guess this is where strong memory ordering bites you. On a weakly-ordered architecture, you could issue the load, execute a bunch of other unrelated work out-of-order, and only put an acquire barrier before the part that actually depends on the event having completed. But on x86, all loads are already acquire, so this won't work if your "unrelated work" involves any loads. If it's pure computation, maybe you can do it all in registers...? – Nate Eldredge Feb 18 '23 at 17:51
  • @NateEldredge: Even if your unrelated work doesn't involve any loads, I think x86 requires loads to fully complete (produce a value) before they can retire from the ROB. I've always assumed that, and it gives you LoadStore ordering for free, but maybe it is possible for the MOB (memory order buffer) to just track the order of outstanding loads and block younger stores from committing? Or maybe not: if you had two outstanding loads and they arrived in the wrong order, you wouldn't be able to roll back (memory order mis-speculation pipeline nuke) if the loads had retired. – Peter Cordes Feb 18 '23 at 21:47
  • @NateEldredge: That's something that could be tested with a microbenchmark, like high-throughput ALU work between pointer-chasing loads that will miss all the way to DRAM. If amount of work between loads can be increased beyond the ROB size without slowdowns, that would indicate that a load buffer can track a load after it retires from the ROB. Actually, that's almost what https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ did, with `nop`s as fillers that are known to use one ROB slot on then-current x86. (Zen 4 can fuse a NOP with the previous insn in some cases.) – Peter Cordes Feb 18 '23 at 21:51
  • 1
    This could be done very simply in hardware, but current architectures don't support the required user-accessible mechanisms. For Intel and AMD processors, I discussed some of the details of the lowest-latency producer-consumer exchanges in https://sites.utexas.edu/jdm4372/2016/11/22/some-notes-on-producerconsumer-communication-in-cached-processors/. – John D McCalpin Feb 19 '23 at 20:55

1 Answers1

0

There's disappointingly little you can do about this on x86 without some very recent ISA extensions, like cldemote (Tremont or Alder Lake / Sapphire Rapids) or user-space IPI (inter-processor interrupts) in Sapphire Rapids, and maybe also Alder Lake. (See Why didn't x86 implement direct core-to-core messaging assembly/cpu instructions? for details on UIPI.)

Without any of those features, the choice between occasional polling (or monitor/mwait if the other core has nothing to do) vs. interrupt depends on how many times you expect to poll before you want to send a notification. (And how much potential throughput you'll lose due to any knock-on effects from the other thread not noticing the flag update soon, e.g. if that means larger buffers leading to more cache misses.)

In user-space, other than shared memory or UIPI, the alternatives are OS-delivered inter-process-communications like a signal or a pipe write or eventfd; the Linux UIPI benchmarks compared it to various mechanisms for latency and throughput IIRC.


AMD CPUs don't broadcast stores; that would swamp the interconnect with traffic and defeat the benefit of private L1d cache for lines that get repeatedly written (between accesses from other cores, even if it avoided it for lines that weren't recently shared.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847