How does schedule()+switch_to() functions from linux kernel actually work?

Question

I'm trying to understand how the schedule process in linux kernel actually works. My question is not about the scheduling algorithm. Its about how the functions schedule() and switch_to() work.

I'll try to explain. I saw that:

When a process runs out of time-slice, the flag need_resched is set by scheduler_tick(). The kernel checks the flag, sees that it is set, and calls schedule() (pertinent to question 1) to switch to a new process. This flag is a message that schedule should be invoked as soon as possible because another process deserves to run. Upon returning to user-space or returning from an interrupt, the need_resched flag is checked. If it is set, the kernel invokes the scheduler before continuing.

Looking into the kernel source (linux-2.6.10 - version that the book "Linux Kernel Development, second edition" is based on), I also saw that some codes can call the schedule() function voluntarily, giving another process the right to run. I saw that the function switch_to() is the one that actually does the context switch. I looked into some architecture dependent codes, trying to understand what switch_to() was actually doing.

That behavior raised some questions that I could not find the answers for :

When switch_to() finishes, what is the current running process? The process that called schedule()? Or the next process, the one that was picked to run?
When schedule() gets called by an interrupt, the selected process to run starts to run when the interrupt handling finishes (after some kind of RTE) ? Or before that?
If the schedule() function can not be called from an interrupt, when is the flag- need_resched set?
When the timer interrupt handler is working, what stack is being used?

I don't know if I could make myself clear. If I couldn't, I hope I can do this after some answers (or questions). I already looked at several sources trying to understand that process. I have the book "Linux Kernel Development, sec ed", and I'm using it too. I know a bit about MIPs and H8300 architecture, if that help to explain.

bdonlan · Answer 1 · 2011-06-30T02:04:53.533

37

After calling switch_to(), the kernel stack is switched to that of the task named in next. Changing the address space, etc, is handled in eg context_switch().
schedule() cannot be called in atomic context, including from an interrupt (see the check in schedule_debug()). If a reschedule is needed, the TIF_NEED_RESCHED task flag is set, which is checked in the interrupt return path.
See 2.
I believe that, with the default 8K stacks, Interrupts are handled with whatever kernel stack is currently executing. If 4K stacks are used, I believe there's a separate interrupt stack (automatically loaded thanks to some x86 magic), but I'm not completely certain on that point.

To be a bit more detailed, here's a practical example:

An interrupt occurs. The CPU switches to an interrupt trampoline routine, which pushes the interrupt number onto the stack, then jmps to common_interrupt
common_interrupt calls do_IRQ, which disables preemption then handles the IRQ
At some point, a decision is made to switch tasks. This may be from the timer interrupt, or from a wakeup call. In either case, set_task_need_resched is invoked, setting the TIF_NEED_RESCHED task flag.
Eventually, the CPU returns from do_IRQ in the original interrupt, and proceeds to the IRQ exit path. If this IRQ was invoked from within the kernel, it checks whether TIF_NEED_RESCHED is set, and if so calls preempt_schedule_irq, which briefly enables interrupts while performing a schedule().
If the IRQ was invoked from userspace, we first check whether there's anything that needs doing prior to returning. If so, we go to retint_careful, which checks both for a pending reschedule (and directly invokes schedule() if needed) as well as checking for pending signals, then goes back for another round at retint_check until there's no more important flags set.
Finally, we restore GS and return from the interrupt handler.

As for switch_to(); what switch_to() (on x86-32) does is:

Save the current values of EIP (instruction pointer) and ESP (stack pointer) for when we return to this task at some point later.
Switch the value of current_task. At this point, current now points to the new task.
Switch to the new stack, then push the EIP saved by the task we're switching to onto the stack. Later, a return will be performed, using this EIP as the return address; this is how it jumps back to the old code that previously called switch_to()
Call __switch_to(). At this point, current points to the new task, and we're on the new task's stack, but various other CPU state hasn't been updated. __switch_to() handles switching the state of things like the FPU, segment descriptors, debug registers, etc.
Upon return from __switch_to(), the return address that switch_to() manually pushed onto the stack is returned to, placing execution back where it was prior to the switch_to() in the new task. Execution has now fully resumed on the switched-to task.

x86-64 is very similar, but has to do slightly more saving/restoration of state due to the different ABI.

edited Jun 30 '11 at 02:04

answered Jun 29 '11 at 22:17

bdonlan

224,562
31
268
324

Sorry, I still don't get it. Example: suppose we have a task 'A' running. 1 - A timer interrupt occurs. 2 - Timer interrupt handler starts. 3 - It's time to call schedule() and we do it. 4 - Task 'B' was chosen by schedule(), switch_to() has already finished, and task 'B' is the current task (now we are using the stack of task 'B' and we are still running interrupt code). 5 - Timer interrupt finishes, and resumes the execution of task 'B'. Is this example correct? If it's not, how does the process happen? – derf Jun 30 '11 at 00:55
Thank you for your time and patience, but I think I'm missing some point. You are providing me more information than I can handle at this moment (and I'm very greatful for that, of course). Could you explain in a more general way? What I really don't understand is: what does happen when switch_to() finishes? Does the chosen task (next) start to run at this point, before interrupt code returns? which stack is the current at this point? – derf Jun 30 '11 at 01:35
I think I get it. To be straight, we have no interrupts, let's ignore all the others details and work only with switch_to(). Suppose I have two tasks, A and B, both of them with 10 instructions to do. Task A is executing instruction 3 and task B is executing instruction 6. Task A is the current task. So, task A calls switch_to(), leaving task A at instruction 3, and resuming task B in instruction 6. Task B goes to 7, and calls switch_to(), leaving B at 7, and resuming A at 3. Task A goes to 4, calls switch_to() again, and the process keep going. Is that correct? – derf Jun 30 '11 at 02:25
More or less, except there are a few more layers of function calls around there, so in practice there are only a few points that switch_to returns to. – bdonlan Jun 30 '11 at 02:29
But, when switch_to() goes from task A to task B, task B actually starts to run after the switch? Cause, I understand if it does from one task jumping directly to another, but if switch_to() get called from an interrupt (from schedule()), for example, it seems to me, that this is going to leave the interrupt code before it finishes, and jump directly to the new task. If it was right, the new task would start in kernel mode, no matter what, cause the IRET would not get executed. There is something I'm missing that make this scenario make no sense. – derf Jun 30 '11 at 02:42
switch_to() itself never transitions from user<->kernel. When a switch happens on interrupt return, after it transitions back later it still goes through the rest of the interrupt return code, which performs the actual kernel->user transition – bdonlan Jun 30 '11 at 02:44
OK, I know that the switch_to() doesn't do the transition kernel<->user modes. Lets do that by steps, I'm pretty slow as you can see. LOL. Task A calls schedule(), task B is chosen as the next, and switch_to() get called. Does task B start to run after switch_to() returns? Is switch_to() that actually starts to run the task B code? – derf Jun 30 '11 at 02:55
It's all kernel code, so it's hard to call the code itself 'task A' or 'task B'. It simulates a function call into __switch_to from task B, so as a result, when it returns, it returns to where it needs to resume from in task B. – bdonlan Jun 30 '11 at 02:57
I understand that its all kernel code. What I was trying to do was isolate the steps. Lets say that we are talking about threads, instead of tasks. Can one kernel thread call schedule() to switch to another kernel thread? (assuming that we only have these kernel threads). – derf Jun 30 '11 at 03:05
Sure. It's not guaranteed to _actually_ switch just by calling `schedule()` (the other thread might be sleeping, or it might be on a different CPU or something). But it can ask the scheduler to switch to some other thread, and if conditions are right, it will indeed switch. – bdonlan Jun 30 '11 at 03:09
Can we talk via email? I think I'm doing something wrong here.. I think it was not suppose to be a chat, and I can not move this discussion to a chat. I can show you my email, if you agree, of course. – derf Jun 30 '11 at 03:13
@derf, get 20 reputation points, then click the 'automatically move this discussion to chat' thing :) Or if you prefer, go over to #c++ @ irc.rizon.net and look for bd_ – bdonlan Jun 30 '11 at 03:14
OK, I think Im almost there. So, lets keep aside for now that details about another CPU or sleep mode.. lets imagine that we have to running threads, at same CPU.. when one thread calls schedule(), the other thread get resumed in what point exactly? at the end of switch_to()? (sorry for been so repetitious, I need to see it clearly to garantee that we are on the same page, so I can explain my point) – derf Jun 30 '11 at 03:21
When you're focused this far in, it's hard to answer exactly what point the other thread is resumed. It's being resumed, piece by piece its state gets loaded in. When do you consider it resumed? When `current` is updated? When its stack is in use? When FP state has been restored? When its page table is in? When it becomes state R=Running? When the function that suspended it resumes execution? – bdonlan Jun 30 '11 at 03:24
When the function that suspended it resumes execution. At this point, all the other stuff are done, right? – derf Jun 30 '11 at 03:26
Kindasorta. switch_to() itself is inlined into context_switch(), so depending on what you call 'the function that suspended it', either after __switch_to() returns, or after context_switch() returns, or after schedule() returns... – bdonlan Jun 30 '11 at 03:30
OK, so I just jump from on position to another from context_switch(), right? something like that: thread A calls schedule() and at some point of it, thread A execution get suspended, switching to thread B.. later, thread B calls schedule() and at some point of it, thread B get suspended, switching to thread A, that get resumed at some point of schedule() inside of thread's A code (cause it is inlined).. schedule() returns from last call and thread A keep running.. is that correct? – derf Jun 30 '11 at 03:38
OK, I hope we are almost there :) What is the "more or less" in that case? – derf Jun 30 '11 at 03:51
Still using the last example.. I was thinking about two threads, calling schedule() voluntarily.. so, thread B called schedule(), getting suspended in some point of schedule(), and switching to some point of schedule() of thread A.. now, during the execution of thread A, an timer interrupt occurred, and schedule() was called from there (this will switch again back to thread B).. but if I call schedule() before the interrupt completely finish, it would not leave the interrupt uncompleted and resume to some point of schedule() inside of thread B? – derf Jun 30 '11 at 04:05
@derf, the interrupt handler would be called prior to the schedule. Part of the interrupt cleanup code will be deferred until it gets to run again. Anyway, this is getting a bit too long, if you need additional clarification, either hop on IRC, or get enough rep to use SO chat (one more upvote is enough), or ask another follow up question :) – bdonlan Jun 30 '11 at 04:12
OK, Im trying to get enough rep :) – derf Jun 30 '11 at 04:22
@bdonlan let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/995/discussion-between-derf-and-bdonlan) – derf Jun 30 '11 at 04:28

How does schedule()+switch_to() functions from linux kernel actually work?

1 Answers1

Linked