Scheduling on multiple cores with each list in each processor vs one list that all processes share

Question

I have a question about how scheduling is done. I know that when a system has multiple CPUs scheduling is usually done on a per processor bases. Each processor runs its own scheduler accessing a ready list of only those processes that are running on it.

So what would be the pros and cons when compared to an approach where there is a single ready list that all processors share?

Like what issues are there when assigning processes to processors and what issues might be caused if a process always lives on one processor? In terms of the mutex locking of data structures and time spent waiting on for the locks are there any issues to that?

Alex · Accepted Answer · 2020-05-11T22:28:18.117

Generally there is one, giant problem when it comes to multi-core CPU systems - cache coherency.

What does cache coherency mean?

Access to main memory is hard. Depending on the memory frequency, it can take between a few thousand to a few million cycles to access some data in RAM - that's a whole lot of time the CPU is doing no useful work. It'd be significantly better if we minimized this time as much as possible, but the hardware required to do this is expensive, and typically must be in very close proximity to the CPU itself (we're talking within a few millimeters of the core).

This is where the cache comes in. The cache keeps a small subset of main memory in close proximity to the core, allowing accesses to this memory to be several orders of magnitude faster than main memory. For reading this is a simple process - if the memory is in the cache, read from cache, otherwise read from main memory.

Writing is a bit more tricky. Writing to the cache is fast, but now main memory still holds the original value. We can update that memory, but that takes a while, sometimes even longer than reading depending on the memory type and board layout. How do we minimize this as well?

The most common way to do so is with a write-back cache, which, when written to, will flush the data contained in the cache back to main memory at some later point when the CPU is idle or otherwise not doing something. Depending on the CPU architecture, this could be done during idle conditions, or interleaved with CPU instructions, or on a timer (this is up to the designer/fabricator of the CPU).

Why is this a problem?

In a single core system, there is only one path for reads and writes to take - they must go through the cache on their way to main memory, meaning the programs running on the CPU only see what they expect - if they read a value, modified it, then read it back, it would be changed.

In a multi-core system, however, there are multiple paths for data to take when going back to main memory, depending on the CPU that issued the read or write. this presents a problem with write-back caching, since that "later time" introduces a gap in which one CPU might read memory that hasn't yet been updated.

Imagine a dual core system. A job starts on CPU 0 and reads a memory block. Since the memory block isn't in CPU 0's cache, it's read from main memory. Later, the job writes to that memory. Since the cache is write-back, that write will be made to CPU 0's cache and flushed back to main memory later. If CPU 1 then attempts to read that same memory, CPU 1 will attempt to read from main memory again, since it isn't in the cache of CPU 1. But the modification from CPU 0 hasn't left CPU 0's cache yet, so the data you get back is not valid - your modification hasn't gone through yet. Your program could now break in subtle, unpredictable, and potentially devastating ways.

Because of this, cache synchronization is done to alleviate this. Application IDs, address monitoring, and other hardware mechanisms exist to synchronize the caches between multiple CPUs. All of these methods have one common problem - they all force the CPU to take time doing bookkeeping rather than actual, useful computations.

The best method of avoiding this is actually keeping processes on one processor as much as possible. If the process doesn't migrate between CPUs, you don't need to keep the caches synchronized, as the other CPUs won't be accessing that memory at the same time (unless the memory is shared between multiple processes, but we'll not go into that here).

Now we come to the issue of how to design our scheduler, and the three main problems there - avoiding process migration, maximizing CPU utilization, and scalability.

Single Queue Multiprocessor scheduling (SQMS)

Single Queue Multiprocessor schedulers are what you suggested - one queue containing available processes, and each core accesses the queue to get the next job to run. This is fairly simple to implement, but has a couple of major drawbacks - it can cause a whole lot of process migration, and does not scale well to larger systems with more cores.

Imagine a system with four cores and five jobs, each of which takes about the same amount of time to run, and each of which is rescheduled when completed. On the first run through, CPU 0 takes job A, CPU 1 takes B, CPU 2 takes C, and CPU 3 takes D, while E is left on the queue. Let's then say CPU 0 finishes job A, puts it on the back of the shared queue, and looks for another job to do. E is currently at the front of the queue, to CPU 0 takes E, and goes on. Now, CPU 1 finishes job B, puts B on the back of the queue, and looks for the next job. It now sees A, and starts running A. But since A was on CPU 0 before, CPU 1 now needs to sync its cache with CPU 0, resulting in lost time for both CPU 0 and CPU 1. In addition, if two CPUs both finish their operations at the same time, they both need to write to the shared list, which has to be done sequentially or the list will get corrupted (just like in multi-threading). This requires that one of the two CPUs wait for the other to finish their writes, and sync their cache back to main memory, since the list is in shared memory! This problem gets worse and worse the more CPUs you add, resulting in major problems with large servers (where there can be 16 or even 32 CPU cores), and being completely unusable on supercomputers (some of which have upwards of 1000 cores).

Multi-queue Multiprocessor Scheduling (MQMS)

Multi-queue multiprocessor schedulers have a single queue per CPU core, ensuring that all local core scheduling can be done without having to take a shared lock or synchronize the cache. This allows for systems with hundreds of cores to operate without interfering with one another at every scheduling interval, which can happen hundreds of times a second.

The main issue with MQMS comes from CPU Utilization, where one or more CPU cores is doing the majority of the work, and scheduling fairness, where one of the processes on the computer is being scheduled more often than any other process with the same priority.

CPU Utilization is the biggest issue - no CPU should ever be idle if a job is scheduled. However, if all CPUs are busy, so we schedule a job to a random CPU, and a different CPU ends up becoming idle, it should "steal" the scheduled job from the original CPU to ensure every CPU is doing real work. Doing so, however, requires that we lock both CPU cores and potentially sync the cache, which may degrade any speedup we could get by stealing the scheduled job.

In conclusion

Both methods exist in the wild - Linux actually has three different mainstream scheduler algorithms, one of which is an SQMS. The choice of scheduler really depends on the way the scheduler is implemented, the hardware you plan to run it on, and the types of jobs you intend to run. If you know you only have two or four cores to run jobs, SQMS is likely perfectly adequate. If you're running a supercomputer where overhead is a major concern, then an MQMS might be the way to go. For a desktop user - just trust the distro, whether that's a Linux OS, Mac, or Windows. Generally, the programmers for the operating system you've got have done their homework on exactly what scheduler will be the best option for the typical use case of their system.

This whitepaper describes the differences between the two types of scheduling algorithms in place.

Multi-core systems have coherent cache between their CPU cores. All mainstream systems use some variant of MESI to *maintain* cache coherency by getting exclusive ownership of a cache line before committing store data into their private cache. Not just write and then sync later. [Does a memory barrier ensure that the cache coherence has been completed?](https://stackoverflow.com/q/42746793) CPUs have a store-buffer internally for stores waiting to commit, but that's always draining itself ASAP. / [Can I force cache coherency on a multicore x86 CPU?](https://stackoverflow.com/a/558888) — Peter Cordes, Oct 12 '22 at 20:07
See also [What happens when different CPU cores write to the same RAM address without synchronization?](https://stackoverflow.com/q/48817022) . No manual flushing is needed to eventually see stores from other cores, even on weakly-ordered systems. But if you need ordering, then memory barriers can delay your later loads/stores until your own earlier stores have become globally visible. So that's an additional expense, on top of waiting for a cache line that was dirty in another core's private L1d. — Peter Cordes, Oct 12 '22 at 20:16
True, but ultimately irrelevant - whether the CPU uses MESI, a traditional write-back cache, or some other mechanism for cache synchronization, the problem still remains - the CPU must take some time away from user code to run the cache coherency logic. This time may be smaller on modern PCs, the logic may be entirely implemented by the hardware - but it's still there, and that's all that matters. For the purposes of this discussion, we need things to be ordered, as we're the kernel and need to keep all our thread structures consistent, no matter what CPU executes them. — Alex, Oct 12 '22 at 20:18
Data gets between cores usually via a shared L3 cache, or cache-to-cache transfers. Not usually writing back to DRAM and getting re-read; that would be a lot slower than necessary so modern CPUs avoid it. The need for this is normally detected by a coherency directory, e.g. in Intel desktop CPUs by L3 tags which also record which core (might) have a modified copy of the data, so it can be asked if another core wants to read or write. — Peter Cordes, Oct 12 '22 at 20:18
Yes, contention for access to shared memory takes time, and even big out-of-order CPU cores can't hide the full latency of cache miss loads that have to get data from another core. Agreed that these errors in explaining how shared memory works in SMP systems don't invalidate the parts of you answer that explain the advantages of different scheduling strategies, but would be nice to fix. There's enough misinformation and misconceptions about CPU caches out there; e.g. https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/ tries to debunk some. — Peter Cordes, Oct 12 '22 at 20:21
Then write an answer if it's really that important to you. But ultimately, it's a nitpick more than anything else. The question was asking about scheduling, and a full explanation about how modern SMP systems do cache coherency isn't really relevant to the discussion - the fact that it's there at all is the problem, and the reason why we have scheduling algorithms in the first place. Otherwise, we could just treat everything like a set of work queues in priority order and just run whatever was first. — Alex, Oct 12 '22 at 20:26
(If we're talking about supercomputers with thousands of cores, that can be one of the rare cases of having non-coherent shared memory. But that wouldn't be a single system image; you'd probably have a job scheduler like GridEngine that just decides what tasks to start, but scheduling context switches or not every few milliseconds usually only happens within each cache-coherent subset of cores. A huge non-coherent cluster isn't a single system image that you'd run a single instance of Linux across.) — Peter Cordes, Oct 12 '22 at 20:26
I've already written answers about cache coherency, and linked one of them in comments earlier. They wouldn't be answers to this question, though, so I'm not going to post them here. It's your answer that's problematic, spreading misinformation about how CPUs work in the first half, before getting to the good part where you talk about different scheduling algorithms requiring different amounts of inter-core communication. — Peter Cordes, Oct 12 '22 at 20:29
Your answer would be equally good if not better if you just said that inter-core communication is expensive because of the required cache coherence. As you say, an explanation of the details isn't relevant to this question. — Peter Cordes, Oct 12 '22 at 20:30
If you think the answer is incomplete, or otherwise inaccurate, feel free to submit your own. I'm not interested in editing a two year old answer to a question to be hyper-accurate to modern systems, as historical systems that did have the aforementioned issues were the whole reason advanced scheduling algorithms needed to be created in the first place. Modern hardware fixed issues that existed in the past, but those past issues are why these algorithms and their intricacies exist to begin with. Even if the information is no longer fully accurate, there's good reason to know the history. — Alex, Oct 12 '22 at 20:39

Scheduling on multiple cores with each list in each processor vs one list that all processes share

1 Answers1