Where does the scheduler run?

Question

Having just finished a book on comp. architecture, I find myself not completely clarified on where the scheduler is running.

What I'm looking to have clarified is where the scheduler is running - does it have it's own core assigned to run that and nothing else, or is the "scheduler" in fact just a more ambiguous algorithm, that it implemented in every thread being executed - ex. upon preemption of thread, a swithToFrom() command is run?

I don't need specifics according to windows x/linux x/mac os x, just in general.

There are three types of general schedulers: Job scheduler also known as the Long term scheduler, Short term scheduler also known as the CPU scheduler, and the Medium term scheduler. In an operating systems book it shows a nice automata of the states these schedulers go to and from. Job scheduler puts things from job queue to ready queue, the CPU scheduler takes things from ready queue to running state. The algorithm is just like any other software, it must be run on a cpu/core, it is most likely probably part of the kernel somewhere. — Omid CompSCI, Aug 14 '17 at 01:16
I guess adding on, if you find out where those job/ready queue etc, are located then you also know where the scheduler code is located. Of course making that code run is a CPU, doesn't necessarily have to be it's own core. — Omid CompSCI, Aug 14 '17 at 01:22
So all in all the data associated to the scheduler is propably either in the cpu cache or ram I suppose. But is the scheduler run outside of the scheduling queue is what I'm looking for? Or can the scheduler be preempted and put on hold aswell? Doesn't really make sense that it could. — Rasmus Hjorth Lüdeking, Aug 14 '17 at 02:09
In other words my question is, that when the kernel has to schedule a new task, does the kernel schedule itself to allocate the task into schedule queue, or does get the cpu time without scheduling itself? — Rasmus Hjorth Lüdeking, Aug 14 '17 at 02:17
Like you said it doesn't make sense the scheduler can be preempted. The jobs inside the queue can be preempted when running, for I/O, etc. No the kernel does not have to schedule itself to allocate the task, it just gets cpu time without scheduling itself. And yes, most likely the data is in probably in ram, not sure if it is worth storing in the cpu cache. — Omid CompSCI, Aug 14 '17 at 03:42
Sure will add right now, glad it can help! Upvote for a decent questions! — Omid CompSCI, Aug 17 '17 at 18:19

Rodney · Answer 1 · 2017-08-18T18:04:06.337

No the scheduler is not run in it's own core. In fact multi-threading was common long before multi-core CPUs were common.

The best way to see how scheduler code interacts with thread code is to start with a simple, cooperative, single-core example.

Suppose thread A is running and thread B is waiting on an event. thread A posts that event, which causes thread B to become runnable. The event logic has to call the scheduler, and, for the purposes of this example, we assume that it decides to switch to thread B. At this point in time the call stack will look something like this:

thread_A_main()
post_event(...)
scheduler(...)
switch_threads(threadA, threadB)

switch_threads will save the CPU state on the stack, save thread A's stack pointer, and load the CPU stack pointer with the value of thread B's stack pointer. It will then load the rest of the CPU state from the stack, where the stack is now stack B. At this point, the call stack has become

thread_B_main()
wait_on_event(...)
scheduler(...)
switch_threads(threadB, threadC)

In other words, thread B has now woken up in the state it was in when it previously yielded control to thread C. When switch_threads() returns, it returns control to thread B.

These kind of manipulations of the stack pointer usually require some hand-coded assembler.

Add Interrupts

Thread B is running and a timer interrupts occurs. The call stack is now

thread_B_main()
foo()   //something thread B was up to
interrupt_shell
timer_isr()

interrupt_shell is a special function. It is not called. It is preemptively invoked by the hardware. foo() did not call interrupt_shell, so when interrupt_shell returns control to foo(), it must restore the CPU state exactly. This is different from a normal function, which returns leaving the CPU state according to calling conventions. Since interrupt_shell follows different rules to those stated by the calling conventions, it too must be written in assembler.

The main job of interrupt_shell is to identify the source of the interrupt and call the appropriate interrupt service routine (ISR) which in this case is timer_isr(), then control is returned to the running thread.

Add preemptive thread switches

Suppose the timer_isr() decides that it's time for a time-slice. Thread D is to be given some CPU time

thread_B_main()
foo()   //something thread B was up to
interrupt_shell
timer_isr()
scheduler()

Now, scheduler() can't call switch_threads() at this point because we are in interrupt context. However, it can be called soon after, usually as the last thing interrupt_shell does. This leaves the thread B stack saved in this state

thread_B_main()
foo()   //something thread B was up to
interrupt_shell
switch_threads(threadB, threadD)

Add Deferred Service Routines

Some OSses do not allow you to do complex logic like scheduling from within ISRs. One solution is to use a deferred service routine (DSR) which runs as higher priority than threads but lower than interrupts. These are used so that while scheduler() still needs to be protected from being preempted by DSRs, ISRs can be executed without a problem. This reduces the number of places a kernel has to mask (switch off) interrupts to keep it's logic consistent.

I once ported some software from an OS that had DSRs to one that didn't. The simple solution to this was to create a "DSR thread" that ran higher priority than all other threads. The "DSR thread" simply replaces the DSR dispatcher that the other OS used.

Add traps

You may have observed in the examples I've given so far, we are calling the scheduler from both thread and interrupt contexts. There are two ways in and two ways out. It looks a bit weird but it does work. However, moving forward, we may want to isolate our thread code from our Kernel code, and we do this with traps. Here is the event posting redone with traps

thread_A_main()
post_event(...)
user_space_scheduler(...)
trap()
interrupt_shell
kernel_space_scheduler(...)
switch_threads(threadA, threadB)

A trap causes an interrupt or an interrupt-like event. On the ARM CPU they are known as "software interrupts" and this is a good description.

Now all calls to switch_threads() begin and end in interrupt context, which, incidentally usually happens in a special CPU mode. This is a step towards privilege separation.

As you can see, scheduling wasn't built in a day. You could go on:

Add a memory mapper
Add processes
Add multiple Cores
Add hyperthreading
Add virtualization

Happy reading!

Peter Cordes · Answer 2 · 2020-08-26T21:34:59.163

4

Each core is separately running the kernel, and cooperates with other cores by reading / writing shared memory. One of the shared data structures maintained by the kernel is the list of tasks that are ready to run, and are just waiting for a timeslice to run in.

The kernel's process / thread scheduler runs on the core that needs to figure out what to do next. It's a distributed algorithm with no single decision-making thread.

Scheduling doesn't work by figuring out what task should run on which other CPU. It works by figuring out what this CPU should do now, based on which tasks are ready to run. This happens whenever a thread uses up its timeslice, or makes a system call that blocks. In Linux, even the kernel itself is pre-emptible, so a high-priority task can be run even in the middle of a system call that takes a lot of CPU time to handle. (e.g. checking the permissions on all the parent directories in an open("/a/b/c/d/e/f/g/h/file", ...), if they're hot in VFS cache so it doesn't block, just uses a lot of CPU time).

I'm not sure if this is done by having the directory-walking loop in (a function called by) open() "manually" call schedule() to see if the current thread should be pre-empted or not. Or maybe just that tasks waking up will have set some kind of hardware time to fire an interrupt, and the kernel in general is pre-emptible if compiled with CONFIG_PREEMPT.

There's an inter-processor interrupt mechanism to ask another core to schedule something on itself, so the above description is an over-simplification. (e.g. for Linux run_on to support RCU sync points, and TLB shootdowns when a thread on another core uses munmap). But it's true that there isn't one "master control program"; generally the kernel on each core decides what that core should be running. (By running the same schedule() function on a shared data-structure of tasks that are ready to run.)

The scheduler's decision-making is not always as simple as taking the task at the front of the queue: a good scheduler will try to avoid bouncing a thread from one core to another (because its data will be hot in the caches of the core it was last running on, if that was recent). So to avoid cache thrashing, a scheduler algorithm might choose not to run a ready task on the current core if it was just running on a different core, instead leaving it for that other core to get to later. That way a brief interrupt-handler or blocking system call wouldn't result in a CPU migration.

This is especially important in a NUMA system, where running on the "wrong" core will be slower long-term, even once the caches populate.

edited Aug 26 '20 at 21:34

answered Aug 18 '17 at 19:29

Peter Cordes

328,167
45
605
847

Is the part about "[each CPU figures out] what this CPU should do now" really true? If one thread/process unblocks another one (e.g., exits a mutex it was waiting on), doesn't the active thread essentially kick off the newly available work on "another" CPU? Certainly it doesn't wait around for the next timer tick on an idle CPU to determine that there is ready work. I'm not sure how much of the scheduler runs in the active "kicking off thread" versus the "kicked off thread" (e.g., perhaps the former only wakes the latter and does little else), but certainly the initiating CPU is involved. – BeeOnRope Aug 18 '17 at 19:41
@BeeOnRope: Good question. IDK if it sends an IPI (Inter-Processor Interrupt) in this case, if it can see that the other core is doing something it should interrupt. Hmm, on x86 the kernel sleeps with `mwait`. Maybe it actually uses `monitor` to set up an address such that a write from another core will wake it up, as an alternative to an IPI. I'm almost certain that the decision-making for the current CPU is done *on* the current CPU. I think prompting a sleeping CPU to `schedule()` is more likely done in the mutex-unlock code, not in the scheduler itself (on Linux). – Peter Cordes Aug 18 '17 at 19:57
Well assume the other core(s) are all sleeping. Either with `mwait` or `hlt` or whatever. Certainly the `mutex-unlock` code which is at least partly user-mode code isn't going to tell another CPU to wake up: those things work with the "thread" (or process for cross-process constructs) abstraction, not the CPU one. They simply tell the kernel "unpark this thread" and then the kernel decides if that _thread_ should now start running on a _CPU_ (e.g., this is what `FUTEX_WAKE` would do kernel-side - I don't even find the [pre-futex calls](https://stackoverflow.com/q/45764378/149138)?). – BeeOnRope Aug 18 '17 at 20:07
So it is guaranteed that the "scheduler" at least partially runs on the waking thread in order to decide (a) if a sleeping core should be woken or if an active core should be interrupted and (b) to actually send the "signal" (could be an IPI or a write to a `mwait` address or firing the timer interrupt immediately). So the scheduler is definitely involved on a CPU other than which ultimately runs the task. The only question is if the woken CPU really does recalculate everything and then usually just run the thread it was woken for, or if its more of a handoff. – BeeOnRope Aug 18 '17 at 20:13
@BeeOnRope: The end of https://stackoverflow.com/questions/23908711/linux-pthread-mutex-and-kernel-scheduler describes the unlock process briefly. In user-space, all you know is that other threads are waiting for the kernel to wake them when the lock becomes available, so you make a system call. I think that inside that system call, the kernel does whatever it takes to wake them and the cores they should run on. This code probably has to know about the design and data-structures of the scheduler, but it's not strictly part of the `schedule()` function or the scheduler per se. – Peter Cordes Aug 18 '17 at 20:14
@BeeOnRope: I think in computer science operating system theory, you'd consider that just an optimization on top of the usual distributed-scheduler algorithm where each CPU decides for itself what to run next. e.g. it might still decide to run something else after being woken because of the futex. (update, just saw your 2nd comment. Yeah, I'm not sure. Possibly the futex code overrides the usual scheduling and a woken CPU runs the waiting thread without running `schedule()`.) – Peter Cordes Aug 18 '17 at 20:16
Regarding _the kernel does whatever it takes to wake them and the cores they should run on_ - that is absolutely the scheduler. Unless you and I have a very different definition of the scheduler. Imagine a time before per-emptive scheduling - in that world it was **100%** of the scheduler! Yes, now you also have the part of the scheduler that runs in the timer interrupt on the CPU that will end up running the chosen task, but the parts that run when a thread is unblocked (or when `fork()` is called etc) are the same code operating on the same state. – BeeOnRope Aug 18 '17 at 20:19
I could argue that the scheduler running _on_ the CPU is just an optimization over cooperative multi-tasking :). Yes, it is seems possible to implement a scheduler that only ever makes decisions "for itself", but it would imply a 0.5 * timeslice-quantum latency to almost _every_ synchronous scheduling decision (other than ones where the decision to suspend the current thread and run another thread on this CPU, e.g., because you woke a higher-priority thread). That's not an obscure optimization: it's how schedulers work, even in (good) textbooks. – BeeOnRope Aug 18 '17 at 20:28
@BeeOnRope: Pre-emptive isn't different from cooperative in that respect. The same scheduler code runs (on the CPU that needs to decide what to do next) whether it's reached from a timer interrupt or from `yield()`, `fork()`, or a system call that blocks for I/O. Cooperative just means that it can't run from timer interrupts. Sleeping is what makes this more complicated: without that you'd just mark a task as unblocked. Idle CPUs would be spinning on `schedule()` instead of sleeping. – Peter Cordes Aug 18 '17 at 20:34
I should have said cooperative. In that case the scheduler pretty much _always_ runs on the currently active CPU, and often makes decisions on behalf of other CPUs. Sleeping isn't critical here: even if the other thread was just looping on `schedule()` the waking thread would have to invoke the scheduler code and modify the shared structures so that the spinning thread will see the runnable task. Let's just be clear: essentially _all_ modern schedulers are invoked both synchronously (as a result of any syscall that might change scheduler state) and asynchronously (e.g., timer tick). – BeeOnRope Aug 18 '17 at 20:40
Synchronous invocation involves the scheduler on the waking thread _and_ the woken thread (sometimes they are the same thread, as in `yield`, but very often and importantly they are not). The involvement of the waking thread is the full deal due to sleeping as you point out: it has to wake the _right_ thread, and to do that it has to invoke the full scheduler logic. – BeeOnRope Aug 18 '17 at 20:43
@BeeOnRope: You have a point about synchronously unblocking a higher-priority task. You don't want to just run `schedule()` to let it take over this CPU if there are other CPUs sitting idle. Probably the decision of what to run on the current CPU depends on what the other CPUs are doing. That's not the same thing as deciding what task another CPU should run and actually sending it a message to that effect, though. Still, I guess you should consider OS-supported mutexes as part of the scheduler. That's a good point. – Peter Cordes Aug 18 '17 at 20:47
Well higher priority has nothing to do with it really (in fact, that was an example I was putting in your column: a case where the waking and woken threads are in fact the same, just like `yield` - although it does depend on what else is running). I'm just talking about the much more usual case of a thread waking another (equal priority) thread. In that case the scheduler is synchronously invoked to decide _exactly_ what CPU to run the task on, and that CPU is woken. Does that CPU redundantly re-calculate what task it should run? Maybe. – BeeOnRope Aug 18 '17 at 20:54
@BeeOnRope: I'd been assuming that a CPU always runs `schedule()` to decide for itself, but you raise a good point. My mental model only works well with idle CPUs busy-waiting for tasks to run, so a thread that wakes another thread (e.g. unlocking a futex) doesn't have to figure out which CPU it should run on. The idle CPUs all attempt (in parallel) to run it, and one of them wins. Obviously it doesn't actually work like that, but I had assumed that waking a sleeping CPU was just an optimization on top of that model. You might be right that it isn't really. – Peter Cordes Aug 18 '17 at 21:02
@BeeOnRope: Does unlocking a futex when no other CPUs are idle actually pre-empt a low-priority userspace task on another CPU to run a just-woken thread that was waiting for the lock (on Linux)? If fast waking only happens for CPUs that were idle, then my model is not far wrong. – Peter Cordes Aug 18 '17 at 21:04
Well that's mostly semantics :). Whether the actual situation is A + optimization or B, really depends on a judgement call regarding what's an optimization and what's the core algorithm :). On many core systems you can be pretty sure that you don't want a thundering herd of cores all competing to schedule. – BeeOnRope Aug 18 '17 at 21:06
About the futex unlock case, it depends heavily on the scheduler used. On Windows at least, it often happens because unblocked threads get a brief "priority boost" which tends to allow them to be scheduled immediately. Often you are trading off interactivity (favors pre-emption) versus throughput (favors letting threads use their existing quantum). Since Linux has a variety of schedulers which dynamically "score" each thread's priority to run, I'm quite sure either scenario can occur, depending on the scores. – BeeOnRope Aug 18 '17 at 21:08
I only used the "idle CPUs" case because then you are pretty juch guaranteed than any non-terrible scheduler will decide to schedule the thread on _some_ other CPU. If all CPUs are busy, nothing changes theoretically: the scheduler is still synchronously invoked, and may decide to schedule the woken thread on (a) this CPU (b) another CPU or (c) leave unscheduled but in the runlist. So the question of where the scheduler runs isn't really affected - there is no `if (idle_cpus) ...` check up front, it just falls out of the algorithm, which will re-evaluate scheduling decisions synchronously. – BeeOnRope Aug 18 '17 at 21:12
@BeeOnRope: https://en.wikipedia.org/wiki/Completely_Fair_Scheduler (Linux) unfortunately doesn't say much about the multi-core implementation details. :( Good point what's an optimization and what's really a different algorithm. You're probably right about waking a thread on another core being fundamentally different than just marking it as runable in the task queue. – Peter Cordes Aug 18 '17 at 21:12
@BeeOnRope: You're still talking about taking one task and deciding where it should run, rather than taking a CPU and deciding which task should be running on it. The way scheduler algorithms are usually described, they use the latter model. e.g. that CFQ wiki link describes the steps for "When the scheduler is invoked" in those terms. But of course it totally ignores optimizing for synchronously-woken/created threads (fork() or futex_wake). – Peter Cordes Aug 18 '17 at 21:21
It's really two ways of talking about the API to one underlying set of rules implemented by some algorithm and associated structures. Yes, that article and a lot of other discussion on frames the core API as `what to run on this CPU`, but the underlying model for schedulers can answer other questions too, like "where, if anywhere, should this task run". They are both just queries to be answered by the data structures maintained by the scheduler. In fact, the kind of "wake" actions extend beyond "wake this thread" to "wake all threads waiting on X", supported by `futex` for example. – BeeOnRope Aug 18 '17 at 21:35
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/152301/discussion-between-beeonrope-and-peter-cordes). – BeeOnRope Aug 18 '17 at 21:41

score -1 · Answer 3 · answered Aug 17 '17 at 18:21

There are three types of general schedulers:

Job scheduler also known as the Long term scheduler.

Short term scheduler also known as the CPU scheduler.

Medium term scheduler, mostly used to swap jobs so there can be non-blocking calls. This is usually for not having too many I/O jobs or to little.

In an operating systems book it shows a nice automata of the states these schedulers go to and from. Job scheduler puts things from job queue to ready queue, the CPU scheduler takes things from ready queue to running state. The algorithm is just like any other software, it must be run on a cpu/core, it is most likely probably part of the kernel somewhere.

It doesn't make sense the scheduler can be preempted. The jobs inside the queue can be preempted when running, for I/O, etc. No the kernel does not have to schedule itself to allocate the task, it just gets cpu time without scheduling itself. And yes, most likely the data is in probably in ram, not sure if it is worth storing in the cpu cache.

Where does the scheduler run?

3 Answers3