73

When talking about multi-threading, it often seems like threads are treated as equal - just the same as the main thread, but running next to it.

On some new processors, however, such as the Apple "M" series and the upcoming Intel Alder Lake series not all threads are equally as performant as these chips feature separate high-performance cores and high-efficiency, slower cores.

It’s not to say that there weren’t already things such as hyper-threading, but this seems to have a much larger performance implication.

Is there a way to query std::thread‘s properties and enforce on which cores they’ll run in C++?

janekb04
  • 4,304
  • 2
  • 20
  • 51
  • With the std alone it is not possible. You can ask for the native handle using `native_handle` and depending on the type returned use the corresponding library to query this information (given that this library exposes that information). But there is neither generic nor platform independent solution for that. – t.niese Jul 19 '21 at 17:14
  • 30
    Threads aren't bound to a chip. The OS moves the threads back and forth as needed – Mooing Duck Jul 19 '21 at 17:14
  • 4
    @MooingDuck for the M1 and macOS it is afaik possible to ask the OS that it should run the thread preferable on a high efficient core. And you theoretically could lock (depending on the os and cpu) lock a process/thread to a single cpu. Which is often do one servers for virtual hosts. – t.niese Jul 19 '21 at 17:16
  • 8
    You are going to need to use OS API to dedicate a thread to a specific core. There are no guarantees that threads will be run on different cores or be executed exclusively on a core. Threads can be run on a single core (in a multi-core) system, just like other tasks. – Thomas Matthews Jul 19 '21 at 17:33
  • 3
    One existing hardware development that could have similar issues is [tag:numa]. The way NUMA-aware code has to handle thread allocation to different processors (to best take advantage of the different memory access speeds) could be insightful. – 1201ProgramAlarm Jul 19 '21 at 19:16
  • 6
    You know, the kernel is invented to handle multi tasks. If you think that you can handle tasks better than the kernel, you are wrong. You can use some os-level API to control your threads, on linux they are `pthread` APIs. – Yves Jul 20 '21 at 03:12
  • Does this answer your question? [How to bind a process to only physical cores in a cross system way?](https://stackoverflow.com/questions/61483639/how-to-bind-a-process-to-only-physical-cores-in-a-cross-system-way) – user2284570 Jul 20 '21 at 23:05
  • An alternative is to use [`cpuset`](https://man7.org/linux/man-pages/man7/cpuset.7.html) to bind a process to a set of CPUs. This does not support multi-threading, but presumably your processes are cheap enough that moving from thread to process is fine. – Akshat Mahajan Jul 21 '21 at 16:09
  • May be you should declare the intents of your threads to the OS and let it decide on which CPU to run it. If every random app starts requesting the fastest CPU it may cause battery drain. – Maxim Egorushkin Aug 07 '21 at 18:49

6 Answers6

63

How to distinguish between high- and low-performance cores/threads in C++?

Please understand that "thread" is an abstraction of the hardware's capabilities and that something beyond your control (the OS, the kernel's scheduler) is responsible for creating and managing this abstraction. "Importance" and performance hints are part of that abstraction (typically presented in the form of a thread priority).

Any attempt to break the "thread" abstraction (e.g. determine if the core is a low-performance or high-performance core) is misguided. E.g. OS could change your thread to a low performance core immediately after you find out that you were running on a high performance core, leading you to assume that you're on a high performance core when you are not.

Even pinning your thread to a specific core (in the hope that it'll always be using a high-performance core) can/will backfire (cause you to get less work done because you've prevented yourself from using a "faster than nothing" low-performance core when high-performance core/s are busy doing other work).

The biggest problem is that C++ creates a worse abstraction (std::thread) on top of the "likely better" abstraction provided by the OS. Specifically, there's no way to set, modify or obtain the thread priority using std::thread; so you're left without any control over the "performance hints" that are necessary (for the OS, scheduler) to make good "load vs. performance vs. power management" decisions.

When talking about multi-threading, it often seems like threads are treated as equal

Often people think we're still using time-sharing systems from the 1960s. Stop listening to these fools. Modern systems do not allow CPU time to be wasted on unimportant work while more important work waits. Effective use of thread priorities is a fundamental performance requirement. Everything else ("load vs. performance vs. power management" decisions) is, by necessity, beyond your control (on the other side of the "thread" abstraction you're using).

Zahra Bayat
  • 903
  • 3
  • 11
  • 14
Brendan
  • 35,656
  • 2
  • 39
  • 66
  • 10
    Yup, TL:DR you usually don't need to distinguish. The OS will already migrate your compute-intensive threads on high-performance cores if they don't start there, if enough of them are free. You might want to *verify* that the OS is in fact doing this by using low-level APIs, especially if your workload is bursty (and thus maybe not simple for the scheduler), or like x264 you typically start more threads than there are logical cores, to keep cores busy when one thread temporarily runs out of work to do. – Peter Cordes Jul 20 '21 at 03:20
  • 2
    Hmm, I guess the decision of how many threads to start could depend on the mix of cores, and how much it affects your parallelization strategy to have some of the threads run slower. That's not a thread->core question, though, just enumerating cores by type, and finding their relative performance for your workload. – Peter Cordes Jul 20 '21 at 03:22
  • 2
    If the high performance cores require more nanojoules of energy for each unit of work completed, migrating CPU-intensive jobs to the higher-performance cores may allow the jobs to be done sooner, but at the cost of increased battery consumption. Which course of action is better for the user may depend upon how long it will be before the user has access to a charging facility--something the user may know, and might be able to tell an application, but the OS probably wouldn't know,. – supercat Jul 20 '21 at 05:26
  • 3
    Given how complex the task and system are, I'd define a successful optimization as losing less than 20% performance. – Eric Duminil Jul 20 '21 at 05:32
  • 5
    @supercat: It's more complex than that (e.g. tasks dealing with user interfaces need latency not throughput, idle CPU/s take longer to wake up, load from other processes isn't known by any one process, some work can be "pre-done" in the background to make performance sensitive work faster, sometimes "power management" means temperature or fan noise management, hyper-threading creates additional "slow core by itself vs. share a fast core" compromises, ...). I'd also assume user would rather tell OS how long before charging (than each app or service), & OS can predict from past usage patterns. – Brendan Jul 20 '21 at 05:59
  • 1
    With M1 chips, which have low and high performance cores and no hyperthreading, it's usually best to use exactly as many threads as cores, and start another thread when one finishes. The OS moves threads between cores to make sure each one can do the same amount of work. – gnasher729 Jul 20 '21 at 10:43
  • 11
    "you're left without any control over the "performance hints" that are necessary" - not quite true. It wasn't standardized because of how much hints vary per system, but you totally can use system-specific calls with help of `std::thread::native_handle()`. – val - disappointed in SE Jul 20 '21 at 14:52
  • 17
    _Stop listening to these fools._ => Could we avoid hyperboles here? The OS has _generic heuristics_ which are generally good, but in special cases are insufficient, or inconvenient. Typical examples are usecases where latency matters, in which case reserving exclusive use of a certain number of cores and pinning threads to those cores work much better that "hoping" that the OS will allow you to reach your target. And yes, this requires reaching beyond C++, the standard offers nothing there. – Matthieu M. Jul 20 '21 at 17:52
  • 1
    +1 If time is that critical, you probably need an RTOS. Trying to outsmart the OS scheduler is a fool's errand. – J... Jul 20 '21 at 18:55
  • "Wasting time on unimportant work while more important work waits" is certainly possible in modern systems, if the programmer isn't careful... google "priority inversion" for details. – Jeremy Friesner Jul 20 '21 at 21:56
  • @gnasher729: "Use as many threads as there are CPUs" is ideal for embarrassingly parallel work. Almost nothing is embarrassingly parallel though (especially on smartphones and desktop systems). Often it's better/easier to divide work along logical boundaries (e.g. maybe a high priority thread for user interface; a "higher or lower depending on if the tab is active" priority thread for each browser tab, a few low priority threads for pre-fetching & checking for updates/changes, ...). If most threads spend most of their time waiting the number of CPUs doesn't matter much. – Brendan Jul 20 '21 at 23:32
  • @valisstillwithMonica: Yes. It's like "here's a nice `std::thread` abstraction you can use to write nice clean portable code and avoid the platform specific tar-pit" and then you take a look and realize it's just a superficial layer of paint hiding a `std::thread::native_handle()` diving board to make it slightly more annoying to dive head first into the platform specific tar-pit. – Brendan Jul 20 '21 at 23:41
  • 5
    "Modern systems do not allow CPU time to be wasted on unimportant work while more important work waits" is an overstatement of the progress made. It's all too easy, and common, to end up with CPU time awfully wasted waiting for some incoming data (key press, character on a serial port), or the next second. But it still typically requires programming skills to make something real-time and CPU-savvy, unless there's a framework that did this hard work. That _should_ be done, and _can_ be done with the right techniques. – fgrieu Jul 21 '21 at 07:51
  • "Almost nothing is embarrassingly parallel though." I would disagree. I think these cases just have less visibility because they are solved trivially, but I encounter embarrassingly parallel cases all the time, particularly with data parallelism. I do work in bioinformatics and so the most obvious and pervasive one is running analyses per chromosome, but I've seen it in plenty of non-genomics problems as well. And we do still work on shared systems (And yes, individual nodes can be shared)... they didn't disappear after the 60s for scientists just because they did for personal computing. – ttbek Jul 21 '21 at 20:37
  • 2
    @ttbek: I'm writing this comment using Chrome (an eclectic mixture of threads) while the OS (Windows) has about 60 processes waiting for things to happen. It could be over 200 threads with none embarrassingly parallel. Soon I'll start Eclipse IDE (another eclectic mixture of threads) and build my project (make, mingw,) Eventually I'll take a break and play a game. None of the games I've ever had are embarrassingly parallel (excluding work done by GPU and not threads). It's hard to guess how many years its been since any CPUs in any of my computers have done anything embarrassingly parallel. – Brendan Jul 22 '21 at 02:22
  • 2
    @ttbek: I'd guess almost all normal people/users are like this (the entirety of smartphone, laptop, workstation and most server); and it's getting worse (things that used to use "embarrassingly parallel threads" are shifting/have shifted to GPGPU instead). It's like "embarrassingly parallel threads" almost doesn't exist outside of some rare/specialized niches (e.g. HPC where you have multiple computers/nodes working on the same problem), which just happens to include your rare/specialized niche. – Brendan Jul 22 '21 at 02:33
  • @Brendan I'd say this is mostly a distinction between consumers and server systems. – Voo Jul 22 '21 at 15:11
  • @Voo: Hrrm - what kind of server (HTTP, mysql, lots of little virtual machines doing different things, ...)? – Brendan Jul 22 '21 at 18:20
  • Naming some things that use some different threads does not imply embarrassingly parallel cases don't exist or are even rare. Counting your browser tabs isn't that compelling either, as all the render threads are doing the same thing on different data. Just because you occasionally have a different thread because have a video in one tab or something like that... just because you have task parallelism doesn't mean there isn't also a lot of embarrassingly parallel tasks. I don't deny that more and more of this is going/has gone to GPU, but I don't know why you exclude those... – ttbek Aug 14 '21 at 14:14
  • Because those all started on CPU and almost every time there is something new, it starts that way as well. To start writing these things with GPU code would be some seriously premature optimization. Typically you write your first thread, then you make it parallel in the easiest way to get the easy performance gains, and then you really do think if it is something you really think is worth moving to CUDA or OpenCL, etc... because GPU code is much less generic. Even in 2011 hardware accelerated (GPU) browsing was frequently broken (on Linux) and CPU rendering 1080p YouTube wasn't fun. – ttbek Aug 14 '21 at 14:21
  • @ttbek: "Everything embarrassingly parallel all the time" (and not just one thing embarrassingly parallel for a short amount of time) is an irrelevant niche that only really exists in HPC/super-computers and crypto-currency mining. The likely scenario (especially with these big.Little architectures) is that the scheduler has to deal with 100+ processes while one part of one process temporarily does something embarrassingly parallel; where the combined load the scheduler sees (from all processes) is still an irregular and constantly changing mixture of loads while it happens. – Brendan Aug 15 '21 at 07:17
  • "irregular and constantly changing mixture of loads" -- has nothing to do with whether or not they are independent and easily separable tasks. You seem to be thinking that I'm arguing for treating all tasks the same, I am not, and we don't in HPC either. In fact, Android uses all the same mechanisms for hinting to the scheduler that we use in HPC. There is the priority, which you mostly don't touch as it is controlled from kernel space, then the nice value in userspace that modifies the userspace priority, and cgroups. Different schedulers prioritize different goals: CFS, GTS, HPS, EAS... – ttbek Aug 18 '21 at 13:21
25

Is there any way to query std::thread‘s properties and enforce on which cores they’ll run in C++?

No. There is no standard API for this in C++.

Platform-specific APIs do have the ability to specify a specific logical core (or a set of such cores) for a software thread. For example, GNU has pthread_setaffinity_np.

Note that this allows you to specify "core 1" for your thread, but that doesn't necessarily help with getting the "performance" core unless you know which core that is. To figure that out, you may need to go below OS level and into CPU-specific assembly programming. In the case of Intel to my understanding, you would use the Enhanced Hardware Feedback Interface.

Brijesh Kalkani
  • 789
  • 10
  • 27
eerorika
  • 232,697
  • 12
  • 197
  • 326
  • 1
    [This function](https://man7.org/linux/man-pages/man3/pthread_getcpuclockid.3.html) looks useful for querying the clock speed for a given thread id. – Patrick Roberts Jul 19 '21 at 17:17
  • 5
    @PatrickRoberts Given that CPUs normally lower clocks when idle and boost when busy, there may be need to stress the core while calling the function to find which core will boost highest. – eerorika Jul 19 '21 at 17:21
  • 9
    @PatrickRoberts: Also keep in mind that it's not just clocks, it's narrow pipeline width (and in-order exec on some ARM big.LITTLE chips, or very limited OoO exec window size on Intel Gracemont vs. Golden Cove) that make the high-efficiency cores slower than the high-performance cores. But yes, you could maybe microbenchmark, or query clocks after a few ms of warm-up, with a thread pinned to a core, to detect which cores are which. As long as you don't assume the clock ratios are a perf ratio, that's fine, just an indicator of slow vs. fast cores. – Peter Cordes Jul 19 '21 at 20:22
  • 2
    http://instlatx64.atw.hu/ has listings for upcoming Alder Lake CPUs, but no links yet to actual CPUID output, so IDK what CPUID will say about family / model, whether that will be uniform across cores or not. They do link https://review.coreboot.org/c/coreboot/+/49629/3/src/soc/intel/common/block/include/intelblocks/mp_init.h#49 which has a single number as the CPUID code for `CPUID_ALDERLAKE_M_A0`, but coreboot might only be running CPUID on the boot core, which might mean it doesn't need to know / recognize the CPUID for the other cores. – Peter Cordes Jul 19 '21 at 20:26
  • 1
    @PeterCordes If you can pin your thread, why not just pin the thread directly to the big or little core? I don't know about iOS, but there are linux syscalls for working out which core is which... – Aron Jul 20 '21 at 04:23
  • @Aron: Why not? Because then your thread won't run at all if all the big cores are busy, but there are idle little cores. That might or might not be desirable. See Brendan's answer. But yes, good point that thread affinity can take a *set* of cores, rather than a single one, so you could pin some thread to "any big core". However, that requires you to have already worked out which core number is which, and *that's* the real question because there's no portable API for that either. (If there are any, they're even less portable than POSIX (pthreads), and may require tables of CPU details) – Peter Cordes Jul 20 '21 at 04:45
  • @PeterCordes because every architecture has HMP multi-core... – Aron Jul 20 '21 at 04:54
  • You shouldn't need to go below OS level. The OS should have a way to tell you this. Just below C++ standard level. – user253751 Jul 20 '21 at 11:18
  • 1
    @Aron `there are linux syscalls for working out which core is which` Which syscall? I'll add it to the answer. – eerorika Jul 20 '21 at 12:05
20

No, the C++ standard library has no direct way to query the sub-type of CPU, or state you want a thread to run on a specific CPU.

But std::thread (and jthread) does have .native_handle(), which on most platforms will let you do this.

If you know the threading library implementation of your std::thread, you can use native_handle() to get at the underlying primitives, then use the underlying threading library to do this kind of low-level work.

This will be completely non-portable, of course.

Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
13

iPhones, iPads, and newer Macs have high- and low-performance cores for a reason. The low-performance cores allow some reasonable amount of work to be done while using the smallest possible amount of energy, making the battery of the device last longer. These additional cores are not there just for fun; if you try to get around them, you can end up with a much worse experience for the user.

If you use the C++ standard library for running multiple threads, the operating system will detect what you are doing, and act accordingly. If your task only takes 10ms on a high-performance core, it will be moved to a low-performance core; it's fast enough and saves battery life. If you have multiple threads using 100% of the CPU time, the high-performance cores will be used automatically (plus the low-performance cores as well). If your battery runs low, the device can switch to all low-performance cores which will get more work done with the battery charge you have.

You should really think about what you want to do. You should put the needs of the user ahead of your perceived needs. Apart from that, Apple recommends assigning OS-specific priorities to your threads, which improves behaviour if you do it right. Giving a thread the highest priority so you can get better benchmark results is usually not "doing it right".

Dominique
  • 16,450
  • 15
  • 56
  • 112
gnasher729
  • 51,477
  • 5
  • 75
  • 98
6

You can't select the core that a thread will be physically scheduled to run on using std::thread. See here for more. I'd suggest using a framework like OpenMP, MPI, or you will have dig into the native Mac OS APIs to select the core for your thread to execute on.

lmeninato
  • 462
  • 4
  • 12
2

macOS provides a notion of "Quality of Service" for tasks, task queues and run loops, and threads. If you use libdispatch/GCD then the queue priorities map to the QoS as well. This article describes the QoS system in detail.

Using the macOS pthreads interface you can set a thread QoS before creating a thread, query a thread's QoS, or temporarily override a thread's QoS level (not visible in the query function though) using the non-portable functions in pthread/qos.h

This system by no means offers guarantees about how your threads will be scheduled, but can be used to make a hint to the scheduler.

I'm not aware of any way to get a similar interface on other systems, but that doesn't mean they don't exist. I imagine they'll become more widely discussed as these hybrid CPUs befome more common.

EDIT: Intel provides information here about how to query this information for their hybrid processors on Windows and for the current CPU using cpuid, haven't had a chance to play with this though.

nfries88
  • 374
  • 6
  • 7