How to start two CPU cores to run instructions at the same time?

Question

For example, in X86, 2 CPU cores are running different software threads.
At a moment, these 2 threads need to run on their CPU cores at the same time.
Is there a way to sync-up these 2 CPU cores/threads, or something like this to make them start to run at (almost) the same time (at instruction level)?

Even if this were possible, they would go out of sync very quickly. — tkausl, Jul 08 '19 at 09:22
@tkausl: Yup, the next cache miss, interrupt, or branch mispredict would do it. But if you manage to avoid that, and the states of the ROB / RS are similar on the two cores, it's plausible that they could run in close to lock-step for a good while. (of course that's very hard, especially avoiding interrupts, or doing anything useful without cache misses or branch misses.) — Peter Cordes, Jul 08 '19 at 09:51
@PeterCordes `it's plausible that they could run in close to lock-step for a good while.` Depends on the definition on "good while". A few thousand instructions? Maybe. Two seconds? Unlikely. They need to be scheduled out and in at the exact same time, which is very unlikely to happen. (OP didn't mention anything about realtime systems, so I just assumed they talk about a regular non-realtime system) — tkausl, Jul 08 '19 at 09:53
Which raises another point: what do you mean by "run"? Issue from the front-end? Execute (dispatch to an execution unit? or complete execution)? Or retire? I'd tend to go with Execute because that's when an instruction like `rdtsc` samples the timestamp counter, so it's the one you can actually record the timing of and then compare later. — Peter Cordes, Jul 08 '19 at 09:53
@tkausl: the scheduler can't run if there are no interrupts. (disabled, or just over a short enough interval that there aren't any, or on cores that you've isolated from interrupts except maybe the timer interrupt.) My "good while" was hundreds or thousands of cycles being plausible, not tens of millions (for a 10 ms timeslice)! — Peter Cordes, Jul 08 '19 at 09:56
I know that after a while, they would go out of sync, but I still wonder if there is a way to sync them to the same moment. — wangt13, Jul 08 '19 at 10:25
Using a variable to sync the two threads would probably result in one seeing the updated value slightly first than the second (unless the ring bus can handle multiple requests simultaneously or the threads are in the same core). Using the interrupts may be more precise (either IPIs, usual IRQs or the TXT wakeup) as the signal path is designed to reach multiple destinations (but if it uses the ring bus to reach the destination core, different cores will probably have different latencies). If the TSC is in sync, `umwait` could give the best results. — Margaret Bloom, Jul 08 '19 at 10:52

BeeOnRope · Answer 1 · 2019-07-08T23:04:02.343

Use a shared variable to communicate a rdtsc based deadline between the two threads. E.g., set a deadline of say the current rdtsc value plus 10,000.

Then have both threads spin on rdtsc waiting until the gap between the current rdtsc value and the threshold is less than a threshold value T (T = 100 should be fine). Finally, use the final gap value (that is, the deadline rdtsc value minus last read rdtsc value) to jump into a sequence of dependent add instructions such that the number of add instructions is equal to the gap.

This final step compensates for the fact that each chip will generally not be "in phase" with respect to their rdtsc spin loop. E.g., assuming a 30-cycle back-to-back throughput for rdtsc readings, one chip may get readings of 890, 920, 950 etc, while the other may read 880, 910, 940 so there will be a 10 or 20 cycle error if rdtsc alone is used. Using the add slide compensation, if the deadline was 1,000, and with a threshold of 100, the first thread would trigger at rdtsc == 920 and execute 80 additions, while the second would trigger at rdtsc == 910 and execute 90 additions. In principle both cores are then approximately synced up.

Some notes:

The above assumes CPU frequency equal to the nominal rdtsc frequency - if that's not the case you'll have to apply a compensation factor based on the nominal to true frequency ration when calculating where to jump into the add slide.
Don't expect your CPUs to say synced for long: anything like an interrupt, a variable latency operation like a cache miss, or a lot of other things can make them get out of sync.
You want all your payload code, and the addition slide to be hot in the icache of each core, or else they are very likely to get out of sync immediately. You can warm up the icache by doing one or more dummy runs through this code prior to the sync.
You want T to be large enough that the gap is always positive, so somewhat larger than the back-to-back rdtsc latency, but no so large as to increase the chance of events like interrupts during the add slide.
You can check the effectiveness of the "sync" by issuing a rdtsc or rdtscp at various points in the "payload" code following the sync up and seeing how close the recorded values are across threads.

A totally different option would be to use Intel TSX: transactional extensions. Organize for the two threads that want to coordinate to both read a shared line inside a transactional region and then spin, and have a third thread to write to the shared line. This will cause an abort on both of the waiting threads. Depending on the inter-core topology, the two waiting threads may receive the invalidation and hence the subsequent TSX abort at nearly the same time. Call the code you want to run "in sync" from the abort handler.

score 3 · Answer 2 · answered Jul 08 '19 at 10:25

Depending on your definition of "(almost) the same time", this is a very hard problem microarchitecturally.

Even the definition of "Run" isn't specific enough if you care about timing down to the cycle. Do you mean issue from the front-end into the out-of-order back-end? Execute? (dispatch to an execution unit? or complete execution successfully without needing a replay?) Or retire?

I'd tend to go with Execute¹ because that's when an instruction like rdtsc samples the timestamp counter. This it's the one you can actually record the timing of and then compare later.

footnote 1: on the correct path, not in the shadow of a mis-speculation, unless you're also ok with executions that don't reach retirement.

But if the two cores have different ROB / RS states when the instruction you care about executes, they won't continue in lock-step. (There are extremely few in-order x86-64 CPUs, like some pre-Silvermont Atoms, and early Xeon Phi: Knight's Corner. The x86-64 CPUs of today are all out-of-order, and outside of low-power Silvermont-family are aggressively so with large ROB + scheduler.)

x86 asm tricks:

I haven't used it, but x86 asm monitor / mwait to have both CPUs monitor and wait for a write to a given memory location could work. I don't know how synchronized the wakeup is. I'd guess that the less deep the sleep, the less variable the latency.

Early wake-up from an interrupt coming before a write is always possible. Unless you disable interrupts, you aren't going to be able to make this happen 100% of the time; hopefully you just need to make it happen with some reasonable chance of success, and be able to tell after the fact whether you achieved it.

(On very recent low-power Intel CPUs (Tremont), a user-space-usable version of these are available: umonitor / umwait. But in kernel you can probably just use monitor/mwait)

If umonitor/umwait are available, that means you have the WAITPKG CPU feature which also includes tpause: like pause but wait until a given TSC timestamp.

On modern x86 CPUs, the TSC is synchronized between all cores by hardware, so using the same wake-up time for multiple cores makes this trivial.

Otherwise you could spin-wait on a rdtsc deadline and probably get within ~25 cycles at worst on Skylake.

rdtsc has one per 25 cycle throughput on Skylake (https://agner.org/optimize/) so you expect each thread to be on average 12.5 cycles late leaving the spin-wait loop, +-12.5. I'm assuming the branch-mispredict cost for both threads is the same. These are core clock cycles, not the reference cycles that rdtsc counts. RDTSC typically ticks close to the max non-turbo clock. See How to get the CPU cycle count in x86_64 from C++? for more about RDTSC from C.

See How much delay is generated by this assembly code in linux for an asm function that spins on rdtsc waiting for a deadline. You could write this in C easily enough.

Staying in sync after initial start:

On a many-core Xeon where each core can change frequency independently, you'll need to fix the CPU frequency to something, probably max non-turbo would be a good choice. Otherwise with cores at different clock speeds, they'll obviously de-sync right away.

On a desktop you might want to do this anyway, in case pausing the clock to change CPU frequency throws things off.

Any difference in branch mispredicts, cache misses, or even different initial states of ROB/RS could lead to major desync.

More importantly, interrupts are huge and take a very long time compared to running 1 more instruction in an already-running task. And it can even lead to the scheduler doing a context switch to another thread. Or a CPU migration for the task, obviously costing a lot of cycles.

How to start two CPU cores to run instructions at the same time?

2 Answers2

x86 asm tricks:

Staying in sync after initial start: