How to measure latency of thread migration away from a busy core by Linux's scheduler?

Question

I have the task of measuring the migration time of a thread in Linux, in programs written in C or C++.

I'm running a single thread on an idle core (with no affinity), then to enforce the scheduler to load-balance and migrate the thread, I pin a second thread on the same core. We know that as soon as a new thread/process is created the scheduler balances the threads causing the first thread to migrate from that core to the least busy one (because the second thread is pinned to that core).

I'd like to measure this thread migration time with a high-resolution timer. I tried perf sched but it only keeps track of the migration events. To clarify, by latency I mean, the difference between the time that the second thread is created on the core, and the time that the first thread is completely migrated to a new core.

How can I measure this latency?

You're going to need some really slick timer precision to be able to measure one migration on a modern desktop PC. You'll probably have to measure thousands or millions of them and remove the time consumed by the thread creation and other effects in the loop. — user4581301, Mar 28 '22 at 21:32
@user4581301: Fortunately modern x86 has [`rdtscp`](https://www.felixcloutier.com/x86/rdtscp) which gets you a nanosecond-precise timestamp and a core ID. The TSC on modern x86 systems is synced across cores, and doesn't vary with CPU frequency so it works as a timesource. [How to get the CPU cycle count in x86\_64 from C++?](https://stackoverflow.com/a/51907627) — Peter Cordes, Mar 28 '22 at 22:07
Slick timers, or rdtscp, are not going to be very reliable. If you want a reliable and systematic thing, you'd really need to add some code to the kernel to measure it there. After you've done that, you may want to expose it to work with perf, or at least with /proc/pid/stat. — BitWhistler, Apr 03 '22 at 09:50

bazza · Answer 1 · 2022-03-28T21:45:06.827

Kernelshark is your friend for this task. Take a look at The Graph Window, and especially the Task Plots (about half way down). You have to have the right options compiled into your kernel (FTRACE for a start), and there's a certain amount of fineness required to control capture of data so that it encompasses the event you want, but then you can get very fine detail of what's going on in the whole machine.

Basically, you can see the whole fate of every thread in the system, what CPU it was running on, when, why, etc. There's what seems to be a useful tutorial here

This kind of tool is brilliant for seeing exactly what's happened in a system. Other things show up too, like CPU clock speed power mode switches (everything can stop stock still for no apparent reason), etc.

Peter Cordes · Answer 2 · 2022-07-05T13:57:09.663

If you're using a modern x86 CPU, the rdtscp instruction is probably useful to get a nanosecond-precise timestamp and a core ID as a single asm instruction, so both come from the same core. Linux sets the IA32_TSC_AUX MSR a small integer matching the core numbering that it uses for affinity, e.g. run under taskset -c 3 ./a.out you get a ID = 3 from the following test program:

#include <x86intrin.h>
#include <stdio.h>

int main(){
   unsigned id;
   unsigned long tsc = __rdtscp(&id);
   printf("tsc = %ld    ID = %d\n", tsc, id);
}

See also How to get the CPU cycle count in x86_64 from C++? for more details on the TSC in general, and the fact that most modern x86 systems have synchronized TSCs across cores. (Especially in single-socket multi-core CPUs, but well-designed multi-socket motherboards can also do it.) Also that the TSC runs at a fixed reference frequency, not core clock cycles, so it's a proxy for wall-clock time.

So we can collect a core-ID with a timestamp as part of a single asm instruction, meaning that both definitely came from the same core. (An interrupt can't come in the middle of a single instruction.) This rules out some kinds of problems.

You might write a loop that spins on __rdtscp until it sees the core ID change, then record or print the last TSC value from the old core and the first TSC value from the new core. (And the delta, but you want the absolute TSC value to compare against a TSC from before starting a new thread or setting thread affinity.)

uint64_t spin_until_migration(some output args)
{
   unsigned old_id, new_id;
   uint64_t old_tsc;
   uint64_t new_tsc = __rdtscp(&old_id);
   do {
       old_tsc = new_tsc;
       new_tsc = __rdtscp(&new_id);
   }while(old_id != new_id);
   // *arg1 = old_tsc;
   // ...
   return new_tsc - old_tsc;
}

(While spinning on the same core, the deltas will typically be around 32 core clock cycles on Skylake for example (https://uops.info/). If the CPU's current frequency is near it's TSC reference frequency (often near its "rated" sticker frequency, e.g. 4008 MHz on my i7-6700k 4.0GHz with 4.2GHz turbo), that's also approximately 32 TSC reference cycles. Bigger deltas will happen for interrupt handlers or other things that temporarily stop user-space from running.)

Note that this code just measures the last time this thread got a timeslice on the old core. The scheduler decision to migrate the thread may come later, after the pinned thread has already been running for some time on this core.

You'll want to record TSC timestamps at various steps of your competing load, like before starting a new thread and after. Or before/after making a CPU-affinity system call that migrates an existing thread. I haven't thought through the full details.

In systems with multiple NUMA nodes, Linux initializes the IA32_TSC_AUX register to encode both the NUMA node (bits 23:12) and the Logical Processor ID (bits 11:0). — John D McCalpin, Mar 31 '22 at 16:12
This would work if you call it frequently enough. "Enough" here means at least as frequently as the kernel migrates you. If you don't call it frequently enough, you can get migrates more than once between 2 calls, and obviously you would never know — BitWhistler, Apr 03 '22 at 09:42
@BitWhistler: Call what? You mean the `__rdtscp` wrapper function? You can think of it more like an intrinsic for a single asm instruction that executes, not a real function you call. (Although it does have to assemble the 64-bit TSC result from two halves with a couple instructions.) `rdtscp` has a throughput of about one per 32 clock cycles if doing it in a loop like here, so yeah, about a million executions per scheduler timeslice with HZ=100 if you just spin on that like `spin_until_migration`, instead of checking occasionally between real work. Every 10 nanos or so should be fine :P — Peter Cordes, Apr 03 '22 at 09:46
@PeterCordes, yes, of course, call rdtsc. It's just easier to say "call" rather than "execute an instruction" And you're also right it should be ok on a system with enough idle time. I did not read it all properly. I use isolated cpus so anything else pops the warning lights :) — BitWhistler, Jul 05 '22 at 13:47
@BitWhistler: If you're looking for a short word that you can apply to a single instruction or intrinsic, "run" works well, as does "exec" or "execute". It's quite distracting to read "call" because that specifically implies a function, and is itself an x86 instruction name. "call rdtsc" is even worse because that's the name of a different instruction, not the function-like intrinsic `__rdtscp`. — Peter Cordes, Jul 05 '22 at 13:58

score 0 · Answer 3 · answered Mar 28 '22 at 21:35

Could you use a hardware performance counter? You didnt say exactly what your CPU was and honestly I couldnt help you a lot about specifics, but most CPUs have a cycle counter in there somewhere. You would write code into your threads to cache the value of the counter and then later you can print it out for cycle accuracy of the timing between two events. Obviously you cant print it right away or it will affect the time

https://en.wikipedia.org/wiki/Hardware_performance_counter

How to get the CPU cycle count in x86_64 from C++?

How to measure latency of thread migration away from a busy core by Linux's scheduler?

3 Answers3