In a busy loop, two continuously getting time encounters a big time gap

Question

In a busy loop, some codes contiguously get the current time twice and count their time interval. The time interval is supposed to be small, since the codes belong to a SCHED_FIFO process with highest priority(99) and could not be interrupted by other processes while running.

rdtsc instruction is used to get the current time from userspace, and cpuid is used to serialize instructions.

the codes are as follow:

  #define RDTSCP(u64) do {                                    \
      unsigned int hi, lo;                                    \
      __asm__ __volatile__ ("RDTSC\n\t" : "=a" (lo), "=d" (hi)); \
      u64 = ((uint64_t)hi << 32) | lo;                        \
  }while(0)

  unsigned int usecs = 20;  //20us
  int tsum = 0;
  for(long long ii = 0; ii < OUTER_LOOP; ++ii) {
      for (long long jj = 0; jj < INNER_LOOP; ++jj) {
          MEM_BAR;
          __asm__ __volatile__ ("CPUID");
          RDTSCP(begin);   // get time
          __asm__ __volatile__ ("CPUID");
          RDTSCP(end);     // get time as well 
          __asm__ __volatile__ ("CPUID");
          MEM_BAR;
          temp = end-begin;
          if (temp < min) min = temp;  // minimum time interval changes
          if (temp > max) max = temp, flag=true;  // maximum time interval changes   
      }   

      if (flag) {
          printf("min: %ld\t\tmax: %ld \t(clocks) \n", min, max);
          fflush(stdout);
          flag=false;
      }   
      usleep(usecs);
  }

In observation of several hours, the biggest time interval in my PC (CPU i9-9900k, 3600MHz) is 578112 (clocks), which is about 160us. This value is extraordinary IMHO.

My question is why there is such a big time interval between these two sequential rdtsc instructions and what causes it ?

The compiler seems to be innocent and translated assembly codes are as follow:

0x5555555554fb <main+594>       mfence                    
0x5555555554fe <main+597>       cpuid                     
0x555555555500 <main+599>       rdtsc                     
0x555555555502 <main+601>       mov    %eax,-0xe0(%rbp)   
0x555555555508 <main+607>       mov    %edx,-0xdc(%rbp)   
0x55555555550e <main+613>       mov    -0xdc(%rbp),%ea
0x555555555514 <main+619>       shl    $0x20,%ra      
0x555555555518 <main+623>       mov    %rax,%rd       
0x55555555551b <main+626>       mov    -0xe0(%rbp),%ea
0x555555555521 <main+632>       or     %rdx,%ra       
0x555555555524 <main+635>       mov    %rax,-0xa0(%rbp)   
0x55555555552b <main+642>       cpuid                     
0x55555555552d <main+644>       rdtsc                     
0x55555555552f <main+646>       mov    %eax,-0xd8(%rbp)   
0x555555555535 <main+652>       mov    %edx,-0xd4(%rbp)   
0x55555555553b <main+658>       mov    -0xd4(%rbp),%ea
0x555555555541 <main+664>       shl    $0x20,%ra      
0x555555555545 <main+668>       mov    %rax,%rd       
0x555555555548 <main+671>       mov    -0xd8(%rbp),%ea
0x55555555554e <main+677>       or     %rdx,%ra       
0x555555555551 <main+680>       mov    %rax,-0x98(%rbp)   
0x555555555558 <main+687>       cpuid                     
0x55555555555a <main+689>       mfence

Your `RDTSCP()` macro uses RDTSC, not RDTSCP. (The actual RDTSCP instruction also writes ECX, so you need a clobber.) Also, `asm("cpuid")` is totally unsafe, writing E[ABCD]X without telling the compiler about it! Use `lfence` instead to serialize execution of `rdtsc` on Intel CPUs (and modern AMD with Spectre mitigation enabled, which strengthens `lfence` to be an OoO exec barrier like Intel guarantees.) It seems you compiled without optimization, so the compiler wasn't depending on keeping anything in registers across statements anyway, otherwise you'd be screwed. — Peter Cordes, Nov 23 '21 at 04:22
None of that explains the measurement, though. The way this compiled, it should be valid. The TSC is synced across cores on the socket (and usually the whole system) unless you do something to desync it, and I think OSes normally avoid that. I assume your system is lightly loaded, so you wouldn't have other tasks keeping all logical cores busy for some time? — Peter Cordes, Nov 23 '21 at 05:12
You didn't specify an OS; I guess Linux given the SCHED_FIFO. For that matter, motherboard could be relevant if an SMI entered SMM on this core for some reason; most mobos don't have any system-management-mode hooks after bootup, but you can check with `turbostat`: [Evaluating SMI (System Management Interrupt) latency on Linux-CentOS/Intel machine](https://stackoverflow.com/q/25399405) . You could also check for CPU migrations, although 160 us seems like a long time for that. — Peter Cordes, Nov 23 '21 at 05:13
@PeterCordes The OS is Linux, and all other cores are almost idle. I know something about SMI/SMM, but I am not sure the SMI is the only reason for such a long time break. Or there are some other types of interrupts that the OS also can not detect. BTW, I measured the SMI latency, which is no more than 10 us, in the kernel mode, refer to https://lwn.net/Articles/316622/ . — foool, Nov 23 '21 at 07:36
Thanks for your explanation about compiler optimization on the code. I have already met such a disaster that the program run unexpectedly and tried to find a hint. BTW, I set the core affinity for the code, so no CPU migration happens. — foool, Nov 23 '21 at 08:45

score 2 · Answer 1 · edited Nov 23 '21 at 08:21

2

Though with the highest scheduling priority, a userspace process may still be preempted by a kernel thread. You may

monitor the kernel scheduler with perf sched https://man7.org/linux/man-pages/man1/perf-sched.1.html to track kernel schedule events and figure out which thread uses the core.
Reduce the kernel noise in the test core to reduce performance variance:
- Isolate the test core from the scheduler entirely with boot option isolcpus
- prevent the kernel from sending schedule interrupts with boot option nohz_full
- move rcu callback to other cores with boot option rcu_nocbs.
- my own experience: avoid use the first and the last core (0 and N).

edited Nov 23 '21 at 08:21

foool

1,462
1
15
29

answered Nov 23 '21 at 02:12

minhuw

429
3
10

A SCHED_FIFO userspace process will be preempted by a kernel thread with the same priority? references? – foool Nov 23 '21 at 07:41
this article on "tickless kernel" may help: https://lwn.net/Articles/549580/. Core is interrupted periodically to do the schedule decision and do some bookkeeping tasks such as RCU callback. Even if there is only one process running on a core, kernel itself will use the core for a while to handle the interrupt. Sorry that "preempt by a kernel thread" may not be precise. Since interrupts should be handled by the kernel and I refer the interrupt handle as "kernel thread", priority is not involved here. – minhuw Nov 23 '21 at 08:20
1

... or just an IRQ happens to run on that core – abligh Nov 23 '21 at 08:31

foool · Answer 2 · 2022-03-30T13:20:26.887

In recent further research, I conducted several experiments and found LOC interrupts are the reasons for unusually large time intervals during the contiguous time acquisitions. Even a full tickless kernel would trigger a LOC interrupt every couple of seconds for the system statistics, so it seems such a microsecond-scale time jitter is inevitable in general Linux.

what I have done to verify this:

Tracing the underlying interrupts with the ftrace tool when querying time in a busy loop, finding out LOC interrupt break off the user program and cause the time jitters.
Rerun the test on a full-isolated CPU core, with help of the task-isolation kernel patch, which disables most interrupts, including LOC interrupts, and such time interval/jitter dismissed.

References:

A full task-isolation mode for the kernel

ftrace documentation

In a busy loop, two continuously getting time encounters a big time gap

2 Answers2