Unexpected periodic behaviour of an ultra low latency hard real time multi-threaded x86 code

Question

I am running code in a loop for multiple iterations on a dedicated CPU with RT priority and want to observe its behaviour over a long time. I found a very strange periodic behaviour of the code.

Briefly, this is what the code does:

Arraythread
{
    while(1)
    {
        if(flag)
            Multiply matrix
            record time;
            reset flag;
    }
}

mainthread
{
    for(30 mins)
    {
        set flag;
        record time;
        busy while(500 μs)
    }
}

Here are the details about the machine I am using:

CPU: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10 GHz
L1 cache: 32K d and 32K i
L2 cache: 1024K
L3 cache: 28160K
Kernel: 3.10.0-693.2.2.rt56.623.el7.x86_64 #1 SMP PREEMPT RT
OS: CentOS
Current active profile: latency-performance
I modified the global limit of Linux real time scheduling (sched_rt_runtime_us) from 95% to 100%
Both the above mentioned threads are bound on a single NUMA node each with priority 99

More details about the code:

mainthread sets a flag every 500 μs. I used CLOCK_MONOTOMIC_RAW with clock_gettime function to read the time (let's say T0).
I put all the variables in a structure to reduce the cache misses.
Arraythread runs a busy while loop and waits for the flag to set.
Once the flag is set it multiplies two big arrays.
Once the multiplication is done it reset the flag and record the time (let's say T1).
I run this experiment for 30 mins (= 3600000 iterations)
I measure the time difference T1-T0 once the experiment is over.

Here is the clock:

The average time of the clock is ~500.5 microseconds. There are flactuations which are expected.

Here is the time taken by the array multiplication:

This is the full 30 minute view of the result.
There are four peaks in the results. The first peak is expected since for the very first time data comes from main memory and the CPU was on sleep.
Apart from the first peak, there are three more peaks and the time difference between peak_3 and peak_2 is 11.99364 mins where the time difference between peak_4 and peak_3 is 11.99358 mins. (I assumed the clock to be 500 μsec)

If I zoom it further:

This image shows what happened over 5 minutes.

If I zoom it further:

This image shows what happened over ~1.25 mins.
You notice that average time is around 113 μsec of the multiplication and there are peaks everywhere.

If I zoom it further:

This image shows what happened over 20 seconds.

If I zoom it further:

This image shows what happened over 3.5 seconds.
The time differences between the starting line of these peaks are: 910 ms, 910 ms, 902 ms (assuming two consecutive points are at 500 μs difference)

If I zoom it further:

This image shows what happened over 500 ms
~112.6 μs is the average time here and complete data is under 1 μs range.

Here are my questions:

Given that L3 cache is good enough to store the complete executable and there is no file read right and there is nothing else is running on the machine, no context switch is happening as well, why do some of the executions take almost double (or sometimes more than double) time? [see the peaks in first result image]
If we forget about those four peaks from the first image, how do I justify the periodic peaks in the results with almost constant time difference? What does the CPU do? These periodic peaks lasts few milliseconds.
I expect the results to be near constant like in the last image. Is there a way or OS/CPU settings I can apply to run the code like last image for infinite time?

Here is the complete code: https://github.com/sghoslya/kite/blob/main/multiThreadProfCheckArray.c

Linux is **not** RTOS. WRT cache, are you using CAT (https://software.intel.com/content/www/us/en/develop/articles/software-enabling-for-cache-allocation-technology.html) Intel's technology which allows to pin cache as memory for certain process? — 0andriy, Jan 13 '21 at 22:39

Unexpected periodic behaviour of an ultra low latency hard real time multi-threaded x86 code

0 Answers0

Linked