Why does Linux's scheduler put two threads onto the same physical core on processors with HyperThreading?

Question

I've read in multiple places that Linux's default scheduler is hyperthreading aware on multi-core machines, meaning that if you have a machine with 2 real cores (4 HT), it won't schedule two busy threads onto logical cores in a way that they both run on the same physical cores (which would lead to 2x performance cost in many cases).

But when I run stress -c 2 (spawns two threads to run on 100% CPU) on my Intel i5-2520M, it often schedules (and keeps) the two threads onto HT cores 1 and 2, which map to the same physical core. Even if the system is idle otherwise.

This also happens with real programs (I'm using stress here because it makes it easy to reproduce), and when that happens, my program understandably takes twice as long to run. Setting affinity manually with taskset fixes that for my program, but I'd expect the a HT aware scheduler to do that correctly by itself.

You can find the HT->physical core assgnment with egrep "processor|physical id|core id" /proc/cpuinfo | sed 's/^processor/\nprocessor/g'.

So my question is: Why does the scheduler put my threads onto the same physical core here?

Notes:

This question is very similar to this other question, the answers to which say that Linux has quite a sophisticated thread scheduler which is HT aware. As described above, I cannot observe this fact (check for yourself with stress -c), and would like to know why.
I know that I can set processors affinity manually for my programs, e.g. with the taskset tool or with the sched_setaffinity function. This is not what I'm looking for, I would expect the scheduler to know by itself that mapping two busy threads to a physical core and leaving one physical core completely empty is not a good idea.
I'm aware that there are some situations in which you would prefer threads to be scheduled onto the same physical core and leave the other core free, but it seems nonsensical that the scheduler would do that roughly 1/4 of the cases. It seems to me that the HT cores that it picks are completely random, or maybe those HT cores that had least activity at the time of scheduling, but that wouldn't be very hyperthreading aware, given how clearly programs with the characteristics of stress benefit from running on separate physical cores.

Try running stress in two processes with one thread each. I haven't looked into the specifics of the Linux scheduler (which may have even changed since the last time I was researching it). It's possible that the kernel prefers to schedule threads in the same process on the same physical processor for reasons like cache-locality. — joshperry, Apr 02 '15 at 20:56
@joshperry "It's possible that the kernel prefers to schedule threads in the same process on the same physical processor" <- it does both, sometimes it schedules them as I expect sometimes, the other way, it seems random and not be biased to either. — nh2, Apr 02 '15 at 20:58
Linux tracks HT threads (and last-level caches and NUMA nodes) via scheduler domains. Please show what `awk '/^domain/ { print $1, $2; } /^cpu/ { print $1; }' /proc/schedstat` prints, it will show CPU masks for scheduler domains. — myaut, Apr 02 '15 at 22:25
Good info @myaut! Very interesting reading. https://lwn.net/Articles/80911/ — joshperry, Apr 02 '15 at 22:36
Still doesn't answers why OP has problems with scheduler. @nh2: Since we can't reproduce your behavior, it seems like a problem with your system (not sure if it is a bug). I think that monitoring `/proc/schedstat` will help can you collect it, run `stress` for couple of minutes and collect new snapshot of `/proc/schedstat`, it may help to reveal if balancing fails. — myaut, Apr 02 '15 at 23:12
@nh2: I wrote a Python script that monitors scheduler statistics: [schedstat.py](https://gist.github.com/myaut/11a656ce7801518c99ce). When I run `stress` I observe that balancer starts on idle CPUs and tries to steal `stress` threads (`.domain0.CPU_IDLE.lb_count` grows). — myaut, Apr 03 '15 at 08:02
@myaut Here's the output of the CPU masks: https://gist.github.com/nh2/b396d83b942458d3691a. On my i7 and Xeon machines, the scheduler does well, but on the i5 I see the problem. — nh2, Apr 03 '15 at 10:10
@myaut For `schedstat.py`, it produces quite an amount of of output. Can you elaborate a bit more what exactly I shall be looking for? Thanks for the effort by the way. — nh2, Apr 03 '15 at 10:11
@nh2: From CPU masks I can see that processor 1 is related to core id 0 while processor 2 is related to core id 1. Can you clarify, what you meant by saying "the two threads onto HT cores 1 and 2, which map to the same physical core."? — myaut, Apr 03 '15 at 10:24
@myaut: I was just numbering them starting from 1. Counting from 0 as you do, I meant *the two threads onto HT cores 0 and 1, which map to the same physical core*. — nh2, Apr 04 '15 at 01:33
@nh2: your schedstat shows that Linux correctly recognized cores as it created two domains for them (one with mask 0x3 = 0011 = CPUs 0,1 and one with mask 0xc = 1100 = CPUs 2,3). Speaking of _schedstat_, we seek for large numbers of `lb_*` parameters. Anyway, I don't think this question is for SO anymore, maybe you will collect `mpstat -PALL` and present your findings in [LKML](http://en.wikipedia.org/wiki/Linux_kernel_mailing_list) — myaut, Apr 06 '15 at 12:26
I see the exact same behavior as the OP on Ubuntu 14.04 and 15.10, but I didn't see this behavior with Ubuntu 12.04. This is on a system with an E3-1240 v3. — Greg Glockner, Apr 13 '16 at 17:29
A paper was made public about a unpleasant bug in scheduler code: http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf The problem seems to be related — myaut, Apr 15 '16 at 11:25
Are you sure lcores 0 and 1 (0-indexed) are on the same physical core? I ask because on all 3 systems I've tried (i7, Xeon, and Xeon), they're not. On my 6-core i7 system, 0 and 6, 1 and 7, ..., 5 and 11 resp. are on the same physical core. Here's a one-liner to test: `cat /proc/cpuinfo | grep -e "processor" -e "core id" -e "physical id"` Edit: Oh, I see you said your i5 has the paired mapping. — sudo, Feb 04 '17 at 01:38

score 8 · Answer 1 · answered Apr 12 '15 at 08:43

I think it's time to summarize some knowledge from comments.

Linux scheduler is aware of HyperThreading -- information about it should be read from ACPI SRAT/SLIT tables, which are provided by BIOS/UEFI -- than Linux builds scheduler domains from that.

Domains have hierarchy -- i.e. on 2-CPU servers you will get three layers of domains: all-cpus, per-cpu-package, and per-cpu-core domain. You may check it from /proc/schedstat:

$ awk '/^domain/ { print $1, $2; } /^cpu/ { print $1; }' /proc/schedstat
cpu0
domain0 0000,00001001     <-- all cpus from core 0
domain1 0000,00555555     <-- all cpus from package 0
domain2 0000,00ffffff     <-- all cpus in the system

Part of CFS scheduler is load balancer -- the beast that should steal tasks from your busy core to another core. Here are its description from the Kernel documentation:

While doing that, it checks to see if the current domain has exhausted its rebalance interval. If so, it runs load_balance() on that domain. It then checks the parent sched_domain (if it exists), and the parent of the parent and so forth.

Initially, load_balance() finds the busiest group in the current sched domain. If it succeeds, it looks for the busiest runqueue of all the CPUs' runqueues in that group. If it manages to find such a runqueue, it locks both our initial CPU's runqueue and the newly found busiest one and starts moving tasks from it to our runqueue. The exact number of tasks amounts to an imbalance previously computed while iterating over this sched domain's groups.

From: https://www.kernel.org/doc/Documentation/scheduler/sched-domains.txt

You can monitor for activities of load balancer by comparing numbers in /proc/schedstat. I wrote a script for doing that: schedstat.py

Counter alb_pushed shows that load balancer was successfully moved out task:

Sun Apr 12 14:15:52 2015              cpu0    cpu1    ...    cpu6    cpu7    cpu8    cpu9    cpu10   ...
.domain1.alb_count                                    ...      1       1                       1  
.domain1.alb_pushed                                   ...      1       1                       1  
.domain2.alb_count                              1     ...                                         
.domain2.alb_pushed                             1     ...

However, logic of load balancer is complex, so it is hard to determine what reasons can stop it from doing its work well and how they are related with schedstat counters. Neither me nor @thatotherguy can reproduce your issue.

I see two possibilities for that behavior:

You have some aggressive power saving policy that tries to save one core to reduce power consumption of CPU.
You really encountered a bug with scheduling subsystem, than you should go to LKML and carefully share your findings (including mpstat and schedstat data)

score 5 · Answer 2 · answered Apr 02 '15 at 21:25

I'm unable to reproduce this on 3.13.0-48 with my Intel(R) Xeon(R) CPU E5-1650 0 @ 3.20GHz.

I have 6 cores with hyperthreading, where logical core N maps to physical core N mod 6.

Here's a typical output of top with stress -c 4 in two columns, so that each row is one physical core (I left out a few cores because my system is not idle):

%Cpu0  :100.0 us,   %Cpu6  :  0.0 us, 
%Cpu1  :100.0 us,   %Cpu7  :  0.0 us, 
%Cpu2  :  5.9 us,   %Cpu8  :  2.0 us, 
%Cpu3  :100.0 us,   %Cpu9  :  5.7 us, 
%Cpu4  :  3.9 us,   %Cpu10 :  3.8 us, 
%Cpu5  :  0.0 us,   %Cpu11 :100.0 us,

Here it is after killing and restarting stress:

%Cpu0  :100.0 us,   %Cpu6  :  2.6 us, 
%Cpu1  :100.0 us,   %Cpu7  :  0.0 us, 
%Cpu2  :  0.0 us,   %Cpu8  :  0.0 us, 
%Cpu3  :  2.6 us,   %Cpu9  :  0.0 us, 
%Cpu4  :  0.0 us,   %Cpu10 :100.0 us, 
%Cpu5  :  2.6 us,   %Cpu11 :100.0 us,

I did this several times, and did not see any instances where 4 threads across 12 logical cores would schedule on the same physical core.

With -c 6 I tend to get results like this, where Linux appears to be helpfully scheduling other processes on their own physical cores. Even so, they're distributed way better than chance:

%Cpu0  : 18.2 us,   %Cpu6  :  4.5 us, 
%Cpu1  :  0.0 us,   %Cpu7  :100.0 us, 
%Cpu2  :100.0 us,   %Cpu8  :100.0 us, 
%Cpu3  :100.0 us,   %Cpu9  :  0.0 us, 
%Cpu4  :100.0 us,   %Cpu10 :  0.0 us, 
%Cpu5  :100.0 us,   %Cpu11 :  0.0 us,

I just tested it on an Intel i7-2600 and a on an Intel Xeon E5-1620, and indeed I get the good behaviour that you describe. But on my Intel i5-2520M, I get the bad scheduling behaviour. One thing I noticed as different is that on the i7 and the Xeon, I have an `N mod 4` mapping (12341234), just as you describe, but on the i5 I have a "paired" mapping (1122). Might that be the difference? — nh2, Apr 03 '15 at 10:15

score -3 · Answer 3 · answered Apr 12 '15 at 07:13

-3

Quoting your experience with two additional processors that seemed to work correctly, the i7-2600 and Xeon E5-1620; This could be a long-shot but how about a CPU microcode update? It could include something to fix the problem if it's internal CPU behaviour.

Intel CPU Microcode Downloads: http://intel.ly/1aku6ak

Also see here: https://wiki.archlinux.org/index.php/Microcode

answered Apr 12 '15 at 07:13

n3rd_dude

1
1

It is nothing to deal with OPs problem, info about Cores and Processor relationship contained in ACPI SRAT/SLIT tables, which are provided by BIOS/UEFI. – myaut Apr 12 '15 at 07:40

Why does Linux's scheduler put two threads onto the same physical core on processors with HyperThreading?

3 Answers3

Linked