Why Free CPU is not allocated to process in linux?

Question

I have one Program P1 having N (100) threads. Initially all threads are in blocking state (using semaphore) except thread 0.

// Program - P1
    static void handler(int sig, siginfo_t *si, void *uc)
    {
        thread_no++;
        ret = sem_post(&sem[(thread_no)%NUM_THREADS]);
            if (ret)
            {
                printf("Error in Sem Post\n");
            }
    }

void *threadA(void *data_)
{
int turn = (intptr_t)data_;
cpu_set_t my_set;        
CPU_ZERO(&my_set);       
CPU_SET(1, &my_set);     
sched_setaffinity(0, sizeof(cpu_set_t), &my_set);

while(1)
    {

        ret = sem_wait(&sem[turn]);
        if (ret)
        {
            printf("Error in Sem Post\n");
        }

        // does some work here


        its.it_value.tv_sec = 0;
        its.it_value.tv_nsec = DELAY1;
        its.it_interval.tv_sec = 0;
        its.it_interval.tv_nsec = 0;

        ret = timer_settime(timerid, 0, &its, NULL);
        if ( ret < 0 )
            perror("timer_settime");

    }  
}

int main(int argc, char *argv[])
{

    sa.sa_flags = SA_RESTART;
    sa.sa_sigaction = handler;
    sigemptyset(&sa.sa_mask);
    err = sigaction(SIG, &sa, NULL);
    if (0 != err) {
        printf("sigaction failed\n"); }

    sev.sigev_notify = SIGEV_SIGNAL;
    sev.sigev_signo = SIG;
    sev.sigev_value.sival_ptr = &timerid;
    ret = timer_create(CLOCKID, &sev, &timerid);
    if ( ret < 0 )
        perror("timer_create");

    sem_init(&sem[0], 0, 1); 
    for ( i = 1; i < NUM_THREADS; ++i)
        {
            sem_init(&sem[i], 0, 0); 
        }   
    data=0;    
    while(data < NUM_THREADS)
    {
        //create our threads
        err = pthread_create(&tid[data], NULL, threadA, (void *)(intptr_t)data);
        if(err != 0)
            printf("\n can't create thread :[%s]", strerror(err));
        data++;
    }
}

I have created one timer using timer_create() inside Program P1 (and all threads are using same timer), sets the timer with interval T. When timer interval expires, timer handler called which sends signal to next thread i+1 to awake.

Here is how my programs works

Step 1: Thread 0 does some work, sets timer ,goes into block state ( releasing CPU )

Step 2: On timer expiration, timer handler called 

Step 3: Thread 1 is awaken, does some work, sets timer , goes into block state ( releasing CPU )

Step 4: On timer expiration, timer handler called 
Step 5: Thread 2 is awaken, does some work, sets timer , goes into block state ( releasing CPU )
:

:

:


Step 2n-1: next thread  Thread n-1 is awaken, does some work, sets timer , goes into block state ( releasing CPU )

Step 2n:  On timer expiration, timer handler called 

Step 2n+1: next thread  Thread 0 is awaken, does some work, sets timer , goes into block state ( releasing CPU )

This works fine, all my threads are awaken by timer handler and running in round robin sequence.

I have another program P2 running continuously since long (so vruntime is very high) even before P1 starts running and P2 is in same CPU core with P1 (with all threads).

// Program - P2
int main(int argc, char *argv[])
{
cpu_set_t my_set;        
CPU_ZERO(&my_set);       
CPU_SET(1, &my_set);     
sched_setaffinity(0, sizeof(cpu_set_t), &my_set);

while(1){

// does some task
   }
}

So, when there will be no running threads, P2 should be running.

So, my expectation is at Step-1 thread 0 is releasing CPU, P2 should be scheduled and the moment timer expires and next thread awaken in Step-3, P2 should be preempted immediately as per CFS scheduling policy ( as vruntime of awaken thread is very low compared to P2). That means, I am expecting P2 gets scheduled in between Step1-Step3 , Step3-Step5 when all threads are blocking state, CPU is free and next thread is not awaken yet.

To get all the processes involved in context switching, I made changes in kernel (kernel/sched/core.c and added following print statement inside context_switch function)

trace_printk(KERN_INFO "**$$,context_switch,%d,%llu,%llu,%d,%llu,%llu\n", (int)(prev->pid),prev->se.vruntime,prev->se.sum_exec_runtime, (int)(next->pid),next->se.vruntime,next->se.sum_exec_runtime);

So that I can get all the processes details which gets scheduled and which is de-scheduled. From the above information, I have calculated how much time P2 is getting in every run before preempted.

Here are some of my finding which I am not able to understand why ?

With CPU Frequency =2.2GHz, if I set timer interval 1700ns or even less, P2 don't gets scheduled between two threads. Why P2 is not scheduled even there is no other process/thread is running and CPU is free for 1700 ns ?
With CPU Frequency =3.4GHz, if I set timer interval 1000ns or even less, P2 don't gets scheduled between two threads. Why P2 is not scheduled even there is no other process is running and CPU is free for 1000 ns ?

With different CPU Frequency and for different timer interval,other process P2 is not scheduled?

Is there any relationship between CPU Frequency and timer expiration time ? Is it related to Context switch timing ?

My last question - if one context switching is under process and a new higher priority process become ready, will the current running context switch be completed, executed for some time, then schedule next higher priority process or it will stop current context switching and schedule next higher priority process immediately?

I am using Core-i7 @3.40 GHz, Ubuntu-16.04, cgroups are disabled, both P1 and P2 running in same terminal . cpupower is used to set user frequency.

Thanks in advance.

Instead of hacking the kernel you might just be able to use `perf sched record` to get scheduling information — Stephan Dollberg, Oct 22 '18 at 20:11
Can you actually provide a compileable example of both P1 and P2. Then people might be able to help. — Stephan Dollberg, Oct 22 '18 at 20:52
1000ns or even 1700ns is an **extremely** small amount of time with respect to scheduling. On many cpus it's not even possible for a trivial syscall to complete in 1000ns, much less a context switch. — R.. GitHub STOP HELPING ICE, Oct 22 '18 at 21:05
@R.., sounds logical. I am wondering how to verify - I have calculated context switch time using other program available in so, but that timing reported is something different timer (clock_gettime vs kernel), so not able to cross verify. — bholanath, Oct 23 '18 at 05:51
@bholanath: Bind two threads to one cpu. In thread A wait for spinlock. In thread B, perform an atomic write to release the spinlock then enter a tight loop performing rdtsc and storing result in an atomic. as soon as thread A acquires the spinlock, read the atomic rdtsc count from thread B and compare it to a new rdtsc. That's your context switch time, from userspace to userspace. — R.. GitHub STOP HELPING ICE, Oct 23 '18 at 14:17

Why Free CPU is not allocated to process in linux?

0 Answers0