If you're using a modern x86 CPU, the rdtscp
instruction is probably useful to get a nanosecond-precise timestamp and a core ID as a single asm instruction, so both come from the same core. Linux sets the IA32_TSC_AUX
MSR a small integer matching the core numbering that it uses for affinity, e.g. run under taskset -c 3 ./a.out
you get a ID = 3
from the following test program:
#include <x86intrin.h>
#include <stdio.h>
int main(){
unsigned id;
unsigned long tsc = __rdtscp(&id);
printf("tsc = %ld ID = %d\n", tsc, id);
}
See also How to get the CPU cycle count in x86_64 from C++? for more details on the TSC in general, and the fact that most modern x86 systems have synchronized TSCs across cores. (Especially in single-socket multi-core CPUs, but well-designed multi-socket motherboards can also do it.) Also that the TSC runs at a fixed reference frequency, not core clock cycles, so it's a proxy for wall-clock time.
So we can collect a core-ID with a timestamp as part of a single asm instruction, meaning that both definitely came from the same core. (An interrupt can't come in the middle of a single instruction.) This rules out some kinds of problems.
You might write a loop that spins on __rdtscp
until it sees the core ID change, then record or print the last TSC value from the old core and the first TSC value from the new core. (And the delta, but you want the absolute TSC value to compare against a TSC from before starting a new thread or setting thread affinity.)
uint64_t spin_until_migration(some output args)
{
unsigned old_id, new_id;
uint64_t old_tsc;
uint64_t new_tsc = __rdtscp(&old_id);
do {
old_tsc = new_tsc;
new_tsc = __rdtscp(&new_id);
}while(old_id != new_id);
// *arg1 = old_tsc;
// ...
return new_tsc - old_tsc;
}
(While spinning on the same core, the deltas will typically be around 32 core clock cycles on Skylake for example (https://uops.info/). If the CPU's current frequency is near it's TSC reference frequency (often near its "rated" sticker frequency, e.g. 4008 MHz on my i7-6700k 4.0GHz with 4.2GHz turbo), that's also approximately 32 TSC reference cycles. Bigger deltas will happen for interrupt handlers or other things that temporarily stop user-space from running.)
Note that this code just measures the last time this thread got a timeslice on the old core. The scheduler decision to migrate the thread may come later, after the pinned thread has already been running for some time on this core.
You'll want to record TSC timestamps at various steps of your competing load, like before starting a new thread and after. Or before/after making a CPU-affinity system call that migrates an existing thread. I haven't thought through the full details.