2

There was a discussion at work related to hyperthreaded Xeon machines. My (superficial) understanding on how Hyperthreading works is that the CPU is physically multiplexing the instructions coming from two "threads". That is, the execution units are shared, but there are two different architectural sets (register sets, instruction queues, maybe even branch-predictors etc) -- one for each thread. The execution units and their buffers / queues are always ready to receive new instructions / data, and that there is no advantage from this angle in disabling one of the threads instead of keeping both.

My colleague was implying that by shutting down hyperthreading we could achieve a speedup, as the CPU running the single thread no longer has to "look" to see if the other thread also has some work to do. My understanding is that all this circuitry is already hardwired to multiplex incoming data/instructions from both threads and that disabling hyperthreading is just going to shut off one of the threads, disallowing it to receive any instructions / data, but that actually nothing else differs. Is this a good mental model of how hyperthreading works?

I do understand that there are a multitude of factors at play, such as the memory working sets, the problem of shared caches, etc, that may influence how well a 2-thread hyperthreaded CPU behaves vs the same CPU with hyperthreading disabled, but my question is more directed towards if disabling hyperthreading somehow makes the whole flow of data / instructions through the pipeline faster or not? Can there be problems of contention when trying to fill up the buffers at the head of the backend, for instance?

My colleague's explanation also somehow included hypervisors, but I fail to see the relation between both? They seem to be orthogonal concepts.

Thanks!

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
devoured elysium
  • 101,373
  • 131
  • 340
  • 557
  • 2
    Some details [in this answer](https://stackoverflow.com/q/35748305/555045), many resources are shared. Looking to see whether the other thread has something to do isn't really literally a thing, both streams of instructions are piled on a heap and looked at. – harold Sep 10 '19 at 13:52
  • HT exploits imperfect usage of the CPU resources, to run a perfectly suited code for your u-architecture it's better to disable HT as you can now have the whole backend for the code. However it is not always possible to use all the resources of the CPU and in general executing with both threads will slow down each program far less than scheduling them in an out. The CPU fetches instructions in the frontend, and you either end up with more instructions that can be executed in the backend (due to dependencies) or with fewer (due to stalls). In both cases HT would help. HVs are not related. – Margaret Bloom Sep 10 '19 at 15:45

2 Answers2

6

Right, hyperthreading works by multiplexing the the instruction streams of each thread in the frontend stages and the retirement stage of the pipeline. In the RS and MOB units, uops from different threads can be dispatched to the execution units or cache pipes in the same cycle. These two regions of the pipelines are mostly oblivious to hyperthreading. Also if one thread is stalled at any stage of the pipeline at a particular cycle, the full bandwidth of that stage at that cycle can be utilized by the other hyperthread(s). The resources (i.e., buffer entries) dedicated for one thread due to partitioning are made available to the other thread(s) if that thread goes into the C1 or deeper sleep state or if hyperthreading is disabled.

Each thread has its own architectural state, as described in Section 8.7.1 of the Intel manual titled "State of the Logical Processors." Most architectural registers are replicated for each thread. This is achieved by replicating the RAT structure in the pipeline. Memory is also part of the architectural state, but Intel processors are all shared-memory processors, which means that memory is shared between all cores of the system.

It's not clear to me in the phrase "shutting down hyperthreading we could achieve a speedup" what performance metric and reference system are being used to make this comparison. If you want to compare the wall-clock execution time of two tasks in the following two configurations:

  • The two tasks are running on the same physical core with hyperthreading enabled and under the assumption that there are no context switches.
  • The two tasks are running on different physical cores with hyperthreading disabled and under the assumption that there are no context switches.

Usually, the second configuration would yield lower execution time, but the possible interactions between the tasks on an HT-capable core are too complicated to know for sure. For example, you mentioned that the two tasks may conflict on the private data caches, but there is also an opportunity for sharing data. In addition, what's happening on the rest of the system can impact the speedup you may get from disabling hyperthreading.

You may need to backtrack a little bit and determine whether this comparison is needed in the first place. If the total number of tasks that are in the runnable state is not larger than the total number of physical cores, would your hypervisor schedule the vCPUs on different physical cores or choose to pack them more tightly on a smaller number of physical cores to put the other cores in a sleep state? The Liunx kernel, for example, usually prefers to schedule one thread on each physical core before utilizing the other logical core of each core. If the number of tasks is larger than the number of physical cores, you need to do a different comparison where hypertheading may give you the advantage of avoiding context switching. This is the main situation where hyperthreading can be improve overall performance. You can even achieve higher speedups by determining which pairs of tasks are good "siblings" and change their affinities so that each friendly pair are scheduled on the same physical core. You'll have to do this optimization manually because most OSes and hypervisors can't do it automatically (but there are research proposals on this).

Another great advantage of hyperthreading is that it can yield better performance-per-energy, which is a better metric to use in case energy consumption is equally important to performance. For example, if there are only two runnable tasks, you may be able to achieve a higher performance-per-energy if the two tasks were run on the logical cores of the same physical core compared to running them on different physical cores, even if there is abundance of physical cores.

The general recommendation is to keep hyperthreading enabled, unless you have a strong empirical evidence or a security-related reason that would justify disabling it.

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
4

When one logical core is in a low-power sleep state, the physical core switches into single-thread mode and un-partitions resources that are statically partitioned when running in HT mode. (Including the ROB, store buffer, iTLB or dTLB on some CPUs, and the IDQ on some CPU where that isn't replicated. Different generations of Intel CPU replicate some features instead of statically partitioning for hyperthreaded mode. Resources that are competitively shared, like back-end execution units and L1d cache, can already be used more heavily by one thread when the other is mostly stalled but not in a sleep state.)

There's a hardware performance-counter for this state: Under Linux you can use
perf stat -e cpu_clk_thread_unhalted.one_thread_active ./my_program. On my 4GHz Skylake, that ticks about 24 MHz when a logical core has a physical core all to itself.

There's nothing special about disabling HT in the BIOS, OS, or hypervisor.

But doing that means timer interrupts or task scheduling or whatever will never ever wake up the sibling core of a core that your code is running on. That will happen if you don't, but the perf impact is very small.

If anything you do on the machine does benefit from hyperthreading, it might make sense to leave HT enabled. (e.g. compiling with make -j: compilers tend to bottleneck on latency, cache misses, and branch mispredicts instead of memory bandwidth, front-end or back-end throughput, or cache footprint.)


as the CPU running the single thread no longer has to "look" to see if the other thread also has some work to do.

That's not the actual mechanism for the perf cost. If both logical threads have instructions ready to run, they alternate cycles in the front-end, issuing groups of 4 uops.

If one logical thread is stalled (e.g. its half of the ROB is full, I-cache miss, or recovering from a branch mispredict), the other logical thread gets all the front-end cycles. This doesn't require switching to "one_thread_active" mode; I think this happens with cycle granularity.

See also https://agner.org/optimize/ for a more in-depth look at how x86 CPUs do superscalar out-of-order execution, and which resources are statically partitioned vs. competitively shared. (And some useful commentary about when HT is useful vs. neutral or harmful for parallel workloads that can scale efficiently with number of threads, like a matmul or something).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847