0

So what I am aware of is that Simultaneous Multithreading (Intel's Hyperthreading for example) enables a single CPU core to efficiently manage several threads at once. And most explainations I find is that it's like you have more than one core at your disposal. But what I'm wondering is if this is what is actually going on at a low level (machine level)? Or is it more like to the OS it just looks ike it is being operated on 2 cores, but in the end Simultaneous Multithreading just makes it much more efficient at going back and forth between two (or more) different threads, giving the illusion of having more than one core?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Kaptain
  • 37
  • 5
  • Unfortunately, you've posted on the wrong site for this question. Stack Overflow is purely for programming questions. You should consider deleting this and reposting on [su] or [cs.se], assuming the question isn't already covered on those sites. – David Buck Sep 05 '21 at 16:47
  • 1
    This is also probably a duplicate of many of the [hyperthreading tagged questions](https://stackoverflow.com/questions/tagged/hyperthreading). [Peter Cordes' answer](https://stackoverflow.com/a/47447569/2467198) may be one of the better answers on the subject on SO. I have posted an answer here, which I believe provides a somewhat different perspective. –  Sep 07 '21 at 14:10

1 Answers1

1

Simultancous multithreading is defined in "Simultaneous Multithreading: Maximizing On-Chip Parallelism" (Dean M. Tullsen et al., 1995, PDF) as "a technique permitting several independent threads to issue instructions to a superscalar’s multiple functional units in a single cycle" ("issue" means initiation of execution — an alternative use of the term means entering into an instruction scheduler). "Simultaneous" refers to the issue of instructions from different threads at the same time, distinguishing SMT from fine-grained multithreading that rapidly switches between threads in execution (e.g., choosing each cycle which thread's instructions to execute) and switch-on-event multithreading (which is more similar to OS-level context switches).

SMT implementations often interleave instruction fetch and decode and commit, making these pipeline stages look more like those of a fine-grain multithreaded or non-multithreaded core. SMT exploits an out-of-order superscalar already choosing dynamically between arbitrary (within a window) instructions recognizing that typically execution resources are not fully used. (In-order SMT provides relatively greater benefits since in-order execution lacks the latency hiding of out-of-order execution, but the pipeline control complexity is increased.)

A barrel processor (pure round-robin, fine-grained thread scheduling with nops issued for non-ready threads) with shared caches would look more like separate cores at 1/thread_count the clock frequency (and shared caches) since such lacks dynamic contention for execution resources. It is also arguable that having instructions from multiple threads in the processor pipeline at the same time represents parallel instruction processing; distinct threads can have instructions being processed (in different pipeline stages) at the same time. Even with switch-on-event multithreading, a cache miss can be processed in parallel with the execution of another thread, i.e., multiple instructions from another thread can be processed during the "processing" of a load instruction.

The distinction from OS-level context switching can be even more fuzzy if the ISA provides instructions that are not interrupt-atomic. For example, on x86 a timer interrupt can lead an OS to perform a context switch while a string instruction is in progress. In some sense, during the entire time slice of the other thread, the string instruction might be considered still to be "executing" since its operation was not completed. With hardware prefetching, some degree of forward progress of the earlier thread might, in theory, continue past the time when another thread starts running, so even a requirement of simultaneous activity in multiple threads might be satisfied. (If processing of long x86 string instructions was handed off to an accelerator, the instruction might run fully in parallel with another thread running on the core that initiated the instruction.)