Execution time inconsistency in a program with high priority in the scheduler using RT Kernel

Question

Problem

We are trying to implement a program that sends commands to a robot in a given cycle time. Thus this program should be a real-time application. We set up a pc with a preempted RT Linux kernel and are launching our programs with chrt -f 98 or chrt -rr 99 to define the scheduling policy and priority. Loading of the kernel and launching of the program seems to be fine and work (see details below).

Now we were measuring the time (CPU ticks) it takes our program to be computed. We expected this time to be constant with very little variation. What we measured though, were quite significant differences in computation time. Of course, we thought this could be undefined behavior in our rather complex program, so we created a very basic program and measured the time as well. The behavior was similarly bad.

Question

Why are we not measuring a (close to) constant computation time even for our basic program?
How can we solve this problem?

Environment Description

First of all, we installed an RT Linux Kernel on the PC using this tutorial. The main characteristics of the PC are:

PC Characteristics	Details
CPU	Intel(R) Atom(TM) Processor E3950 @ 1.60GHz with 4 cores
Memory RAM	8 GB
Operating System	Ubunut 20.04.1 LTS
Kernel	Linux 5.9.1-rt20 SMP PREEMPT_RT
Architecture	x86-64

Tests

The first time we detected this problem was when we were measuring the time it takes to execute this "complex" program with a single thread. We did a few tests with this program but also with a simpler one:

The CPU execution times
The wall time (the world real-time)
The difference (Wall time - CPU time) between them and the ratio (CPU time / Wall time).

We also did a latency test on the PC.

Latency Test

For this one, we followed this tutorial, and these are the results:

Latency Test Generic Kernel

Latency Test RT Kernel

The processes are shown in htop with a priority of RT

Test Program - Complex

We called the function multiple times in the program and measured the time each takes. The results of the 2 tests are:

From this we observed that:

The first execution (around 0.28 ms) always takes longer than the second one (around 0.18 ms), but most of the time it is not the longest iteration.
The mode is around 0.17 ms.
For those that take 17 ms the difference is usually 0 and the ratio 1. Although this is not exclusive to this time. For these, it seems like only 1 CPU is being used and it is saturated (there is no waiting time).
When the difference is not 0, it is usually negative. This, from what we have read here and here, is because more than 1 CPU is being used.

Test Program - Simple

We did the same test but this time with a simpler program:

#include <vector>
#include <iostream>
#include <time.h>

int main(int argc, char** argv) {
    int iterations = 5000;
    double a = 5.5;
    double b = 5.5;
    double c = 4.5;
    std::vector<double> wallTime(iterations, 0);
    std::vector<double> cpuTime(iterations, 0);
    struct timespec beginWallTime, endWallTime, beginCPUTime, endCPUTime;

    std::cout << "Iteration | WallTime | cpuTime" << std::endl;

    for (unsigned int i = 0; i < iterations; i++) {
        // Start measuring time
        clock_gettime(CLOCK_REALTIME, &beginWallTime);
        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &beginCPUTime);

        // Function
        a = b + c + i;

        // Stop measuring time and calculate the elapsed time
        clock_gettime(CLOCK_REALTIME, &endWallTime);
        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endCPUTime);

        wallTime[i] = (endWallTime.tv_sec - beginWallTime.tv_sec) + (endWallTime.tv_nsec - beginWallTime.tv_nsec)*1e-9;
        cpuTime[i] = (endCPUTime.tv_sec - beginCPUTime.tv_sec) + (endCPUTime.tv_nsec - beginCPUTime.tv_nsec)*1e-9;

        std::cout << i << " | " << wallTime[i] << " | " << cpuTime[i] << std::endl;
    }
    return 0;
}

Final Thoughts

We understand that:

If the ratio == number of CPUs used, they are saturated and there is no waiting time.
If the ratio < number of CPUs used, it means that there is some waiting time (theoretically we should only be using 1 CPU, although in practice we use more).

Of course, we can give more details.

Thanks a lot for your help!

In the simple case as `a` is not used has `a = b + c + i;` been removed by the optimizer ? Again in the simple case what is latency and jitter of calling `clock_gettime` 4 times ie how is the clock read such that it is not halfway through an update ? — Richard Critten, Jun 10 '22 at 14:27
Even if by some miracle it isn't removed it will take no time in comparison to getting the time. — Goswin von Brederlow, Jun 10 '22 at 20:01
@RichardCritten: Linux `clock_gettime` for `CLOCK_REALTIME` should IIRC be fully in user-space (using code + data from VDSO pages the kernel maps into user-space). It uses `rdtsc` to interpolate a fine-grained offset from the last timer tick. If it needs a 16-byte atomic read, it probably uses a seqlock, which normally doesn't have to retry but could if a timer interrupt arrives between reads of 64-bit halves. IDK about `CLOCK_PROCESS_CPUTIME_ID` ; there might be data exported for that, too. — Peter Cordes, Jun 10 '22 at 21:10
Related: [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) re: `a = b+c+i` being way too trivial to be meaningful to benchmark. — Peter Cordes, Jun 11 '22 at 04:39

score 1 · Answer 1 · answered Jun 10 '22 at 20:16

Your function will near certainly be optimized away so you are just measuring how long it takes to read the clocks. And as you can see that doesn't take very long with some exceptions:

The very first time you run the code (unless you just compiled it) the pages need to be loaded from disk. If you are unlucky the code spans pages and you include the loading of the next page in the measured time. Quite unlikely given the code size.

The first loop the code and any data needs to be loaded into cache. So that takes longer to execute. The branch predictor might also need a few loops to predict the loop right so the second, third loop might be slightly longer too.

For everything else I think you can blame scheduling:

an IRQ happens but nothing gets rescheduled
the process gets paused while another process runs
the process gets moved to another CPU thread leaving the caches hot
the process gets moved to another CPU core making L1 cache cold but leaving L2/L3 caches hot (if your L2 is shared)
the process gets moved to a CPU on another socket making L1/L2 caches cold but L3 cache hot (if L3 is shared)

You can do little about IRQs. Some you can fix to specific cores but others are just essential (like the timer interrupt for the scheduler itself). You kind of just have to live with that.

But you can fix your program to a specific CPU and you can fix everything else to all the other cores. Basically reserving the core for the real-time code. I guess you would have to use cgroups for this, to keep everything else off the chosen core. And you might still get some kernel threads run on the reserved core. Nothing you can do about that. But that should eliminate most of the large execution times.

There's also lazy dynamic linking on the first call to `clock_gettime`, but that happens *before* the first time measurement. So does loading glibc pages from disk (or soft page faulting to map them from the pagecache). But yeah, any slowness from the first exec of the `CLOCK_PROCESS_CPUTIME_ID` code path (probably in the VDSO, a kernel page already mapped) will be part of the wall-clock timed interval, since it happens after the first `CLOCK_REALTIME` measurement.) — Peter Cordes, Jun 10 '22 at 21:14