AMD SMT or Intel HT performance

Question

I don't really understand why processors with doubled logical processors are much more expensive then single logical processors. As far as I noticed there is no difference with running code on 6 or 12 threads for 6 cores/12 threads CPU.

As monkeys asked, here is C# example emulating heavy load on each thread:

static void Main(string[] args)
    {
        if (IntPtr.Size != 8)
            throw new Exception("use only x64 code, 2020 is coming...");

        //6 for physical cores, 12 for logical cores
        const int limit_threads = 12; 
        const int limit_actions = 256;
        const int limit_loop = 1000 * 1000 * 10;
        const double power = 1.0 / 17.0;

        long result = 0;
        var action = new Action(() =>
        {
            long value = 0;
            for (int i = 0; i < limit_loop; i++)
                value += (long)Math.Pow(i, power);

            Interlocked.Add(ref result, value);
        });

        var actions = Enumerable.Range(0, limit_actions).Select(x => action).ToArray();
        var sw = Stopwatch.StartNew();

        Parallel.Invoke(new ParallelOptions()
        {
            MaxDegreeOfParallelism = limit_threads
        }, actions);

        Console.WriteLine($"done in {sw.Elapsed.TotalSeconds}s\nresult={result}\nlimit_threads={limit_threads}\nlimit_actions={limit_actions}\nlimit_loop={limit_loop}");
    }

Results for 6 threads (AMD Ryzen 2600):

done in 13,7074543s
result=5086445312
limit_threads=6
limit_actions=256
limit_loop=10000000

Results for 12 threads (AMD Ryzen 2600):

done in 11,3992756s
result=5086445312
limit_threads=12
limit_actions=256
limit_loop=10000000

It's about 10% performance boost with using all logical cores instead of only physical, which is almost null. What you can say now?

Can someone provide sample code which will be valuable faster with using processor multi-threading (AMD SMT or Intel HT) comparing to using only physical cores?

I dont get your point. Assume you have 10 bottles of beer to move from the kitchen to the dining room. Compare: you are carrying them alone, two at a time, versus: 5 other people are there, and each one takes 2 bottles. That is what happens when you have more than one core. You get **real** parallelism, compared to "fake" time slot based scheduling. — GhostCat, Nov 28 '19 at 08:29
You ask two question and neither is a good SO question. For the first question, they're more expensive because they have more transistors. They have more transistors because they use them to implement more features. For the second one, just use any loop and try it in two different processors, or on the same processor with HT enabled and disabled. Or go to each company's site and check the code examples. — Panagiotis Kanavos, Nov 28 '19 at 08:30
@GhostCatsaysReinstateMonica yep it should be so, but it is not. Let's say you have AMD Ryzen 2600 and two programs with identical algo, but different number of threads - 6 and 12. They will finish job at the same time, check yourself. — , Nov 28 '19 at 08:31
To understand more, you'll have to look into how CPUs work. They are **VERY** complex devices - one CPU actually has a pipeline of components that execute different parts of a command. Different circuits implement different operations and some operations require the combination of multiple circuits. With HT, more than two commands can be in the pipeline at the same time. — Panagiotis Kanavos, Nov 28 '19 at 08:33
Then ask specifically about that. Your question is basically a (way too broad) requirements dump, with "write code for me". That is not a good starter. Turn it around: when you have code that behaves in unexpected ways, then add a [mcve] and ask about that. — GhostCat, Nov 28 '19 at 08:33
@PanagiotisKanavos that is the point, according to my tests there is no differences with HT enabled or disabled. Moreover, Intel Math Kernel Library also using only physical cores count, not logical cores. — , Nov 28 '19 at 08:35
`They will finish job at the same time, check yourself` check *what*? What code? Because I do have code that works better with HT - loop optimization is performed by the compiler and CPU. You can write a loop so badly that neither the compiler nor the CPU can optimize it — Panagiotis Kanavos, Nov 28 '19 at 08:35
What tests? On which CPUs, what machine, what patch? You're asking people to guess why *your* code isn't running fast enough? Without even showing the code? — Panagiotis Kanavos, Nov 28 '19 at 08:36
@PanagiotisKanavos any machine, any CPU with HT, any code. Is it clear now? While you incerementing threads below physical cores count - you will get performance boost, but not over it. — , Nov 28 '19 at 08:38
I can prove that HT works *very* easily. Create a big matrix in R and use `svd` on a quad machine with the official distro. You get X time, using only one core. Now use Revolution R's (now Microsoft's) which uses Intel's Math library. You get **7 times** better performance, using *all* logical cores. On a quad-core. — Panagiotis Kanavos, Nov 28 '19 at 08:39
@PanagiotisKanavos I will get 7 times better performance, using all physical cores. Intel MKL using ONLY PHSICAL CORES, not logical. — , Nov 28 '19 at 08:42
Read the answers. You didn't even mention which language you use yet, but I can tell you right now that 1) .NET's optimizations don't go that deep. 2) That `Parallel.Invoke` is wrong, and adding that Interlocked adds a synchronization point. `Parallel` is for *data* parallelism. If you really wanted to measure performance you'd have to use a **LOT** of data (millions of numbers) with `Parallel.ForEach`. Otherwise the partitioning and parallelization overhead will cost more than the benefit. The effects of *memory transfers* don't appear when there's so little data it can fit in the CPU's cache — Panagiotis Kanavos, Nov 28 '19 at 10:00
And finally, your code doesn't control HT at all. It just creates extra threads, and yet, there was a 16% improvement that may or may not be due to transient reasons. 11 sec is too short a duration and using Stopwatch like that will also measure system delays. Use BenchmarkDotNet to get meaningful results. — Panagiotis Kanavos, Nov 28 '19 at 10:05
To actually measure the effect of HT run the same benchmark before and after disabling HT in your machine's BIOS. Don't force the number of threads, just let .NET do its job. By default it creates as many partitions and workers as there are cores. — Panagiotis Kanavos, Nov 28 '19 at 10:07
@PanagiotisKanavos I'm not planning to tweak my code to slow down, only speed up. So show me any example on any language with huge HT performance boost, which will be impossible to optimize using only physical cores. — , Nov 28 '19 at 10:10
@AlekDepler given the flawed logic, you already proved that HT boosts performance (16%), although I believe that's an artifact of bad benchmarking and completely inappropriate code. HT wasn't turned off at all, it was used in both cases. I'm telling you how to boost performance, not how to slow it down. — Panagiotis Kanavos, Nov 28 '19 at 10:12
@AlekDepler now, `Parallel.ForXYZ` methods have overloads that collect partial results and allow a local `finally` operation. That should be what increments any totals. Again though, without any data you can't measure data parallelism performance — Panagiotis Kanavos, Nov 28 '19 at 10:14
@AlekDepler read [Parallel's Remarks section](https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.parallel?view=netframework-4.8#remarks) `The Parallel class provides library-based data parallel replacements for common operations such as for loops, for each loops, and execution of a set of statements.` — Panagiotis Kanavos, Nov 28 '19 at 10:16
Not to mention you haven't accounted for the Windows scheduler. By increasing the number of threads you have changed the Quantum allocation to your process. You are so far from the metal that almost any action you do could cause unintented performance differences. For example depending on your PC ram state, the JITer can produce code that crosses memory boundaries, which in cases had been known to cause upto 20% perf degradation. — Aron, Nov 28 '19 at 10:31
I also noticed that your work load has a very low queue depth, where it is possible to interleave the Core between the two threads. Real world work loads are typically more complex meaning that running the CPU's full pipeline will result in superscalar performance, which is impossible with the Interlocked.Add on a tight loop. — Aron, Nov 28 '19 at 10:41
@AlekDepler HT wasn't designed to give performance boosts for running a program, it was designed to give performance boosts to a modern multitasking/multiuser environment. A well designed cooperatively threaded OS would make HT completely redundant. — Aron, Nov 28 '19 at 10:46
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/203262/discussion-between-aron-and-panagiotis-kanavos). — Aron, Nov 28 '19 at 16:40

prog-fh · Answer 1 · 2019-11-28T22:18:05.267

I think that varying the price of the processors depending on the availability of the SMT/HT technology is just a matter of marketing strategy.
The hardware is probably the same in every case but the feature is disabled by the manufacturer on some of them to offer cheap models.

This technology relies on the fact that some micro-operations in a single instruction have to wait for something to be executed; so instead of just waiting, the same core uses its circuits to make some progress on the micro-operations from another thread.
On a coarse point of view, we can perceive the execution of two (or more on certain models) sequences of micro-operations from two different threads executed on a single piece of hardware (except some redundant parts, like registers...)

The efficiency of this technology depends on the problem.
After various tests I noticed that if the problem is compute bound, ie the limiting factor is the time needed to compute (add, multiply...), but not memory bound (the data are already available, no need to wait for the memory), then this technology does not provide any benefit.
This is due to the fact that there is no gap to fill in the two sequences of micro-operations, thus the intertwined execution of two threads is not better than two independent serial executions.
In the exact opposite case, when the problem is memory bound but not compute bound, there is no more benefit because both threads have to wait for the data coming from memory.
I only noticed an improvement in performances when the problem is mixed between data access and computation; in this case when one thread is waiting for data, the same core can make some progress in the computations of the other thread and vice-versa.

Edit
Below is given an example to illustrate these situations, and I obtain the following results (quite stable when run many times, dual Xeon E5-2697 v2, Linux 5.3.13).

In this memory bound situation HT does not help.

$ ./prog_ht mem
24 threads running memory_task()
result: 1e+17
duration: 13.0383 seconds
$ ./prog_ht mem ht
48 threads (ht) running memory_task()
result: 1e+17
duration: 13.1096 seconds

In this compute bound situation HT helps (almost 30% gain)
(I don't know exactly the details of what is implied in the hardware when computing cos, but there must be some latencies which are not due to memory access)

$ ./prog_ht
24 threads running compute_task()
result: -260.782
duration: 9.76226 seconds
$ ./prog_ht ht
48 threads (ht) running compute_task()
result: -260.782
duration: 7.58181 seconds

In this mixed situation HT helps much more (around 70% gain)

$ ./prog_ht mix
24 threads running mixed_task()
result: -260.782
duration: 60.1602 seconds
$ ./prog_ht mix ht
48 threads (ht) running mixed_task()
result: -260.782
duration: 35.121 seconds

Here is the source code (in C++, I'm not confortable with C#)

/*
  g++ -std=c++17 -o prog_ht prog_ht.cpp \
      -pedantic -Wall -Wextra -Wconversion \
      -Wno-missing-braces -Wno-sign-conversion \
      -O3 -ffast-math -march=native -fomit-frame-pointer -DNDEBUG \
      -pthread
*/

#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <thread>
#include <chrono>
#include <cstdint>
#include <random>
#include <cmath>

#include <pthread.h>

bool // success
bind_current_thread_to_cpu(int cpu_id)
{
  /* !!!!!!!!!!!!!! WARNING !!!!!!!!!!!!!
  I checked the numbering of the CPUs according to the packages and cores
  on my computer/system (dual Xeon E5-2697 v2, Linux 5.3.13)
     0 to 11 --> different cores of package 1
    12 to 23 --> different cores of package 2
    24 to 35 --> different cores of package 1
    36 to 47 --> different cores of package 2
  Thus using cpu_id from 0 to 23 does not bind more than one thread
  to each single core (no HT).
  Of course using cpu_id from 0 to 47 binds two threads to each single
  core (HT is used).
  This numbering is absolutely NOT guaranteed on any other computer/system,
  thus the relation between thread numbers and cpu_id should be adapted
  accordingly.
  */
  cpu_set_t cpu_set;
  CPU_ZERO(&cpu_set);
  CPU_SET(cpu_id, &cpu_set);
  return !pthread_setaffinity_np(pthread_self(), sizeof(cpu_set), &cpu_set);
}

inline
double // seconds since 1970/01/01 00:00:00 UTC
system_time()
{
  const auto now=std::chrono::system_clock::now().time_since_epoch();
  return 1e-6*double(std::chrono::duration_cast
                     <std::chrono::microseconds>(now).count());
}

constexpr auto count=std::int64_t{20'000'000};
constexpr auto repeat=500;

void
compute_task(int thread_id,
             int thread_count,
             const int *indices,
             const double *inputs,
             double *results)
{
  (void)indices; // not used here
  (void)inputs; // not used here
  bind_current_thread_to_cpu(thread_id);
  const auto work_begin=count*thread_id/thread_count;
  const auto work_end=std::min(count, count*(thread_id+1)/thread_count);
  auto result=0.0;
  for(auto r=0; r<repeat; ++r)
  {
    for(auto i=work_begin; i<work_end; ++i)
    {
      result+=std::cos(double(i));
    }
  }
  results[thread_id]+=result;
}

void
mixed_task(int thread_id,
           int thread_count,
           const int *indices,
           const double *inputs,
           double *results)
{
  bind_current_thread_to_cpu(thread_id);
  const auto work_begin=count*thread_id/thread_count;
  const auto work_end=std::min(count, count*(thread_id+1)/thread_count);
  auto result=0.0;
  for(auto r=0; r<repeat; ++r)
  {
    for(auto i=work_begin; i<work_end; ++i)
    {
      const auto index=indices[i];
      result+=std::cos(inputs[index]);
    }
  }
  results[thread_id]+=result;
}

void
memory_task(int thread_id,
            int thread_count,
            const int *indices,
            const double *inputs,
            double *results)
{
  bind_current_thread_to_cpu(thread_id);
  const auto work_begin=count*thread_id/thread_count;
  const auto work_end=std::min(count, count*(thread_id+1)/thread_count);
  auto result=0.0;
  for(auto r=0; r<repeat; ++r)
  {
    for(auto i=work_begin; i<work_end; ++i)
    {
      const auto index=indices[i];
      result+=inputs[index];
    }
  }
  results[thread_id]+=result;
}

int
main(int argc,
     char **argv)
{
  //~~~~ analyse command line arguments ~~~~
  const auto args=std::vector<std::string>{argv, argv+argc};
  const auto has_arg=
    [&](const auto &a)
    {
      return std::find(cbegin(args)+1, cend(args), a)!=cend(args);
    };
  const auto use_ht=has_arg("ht");
  const auto thread_count=int(std::thread::hardware_concurrency())
                          /(use_ht ? 1 : 2);
  const auto use_mix=has_arg("mix");
  const auto use_mem=has_arg("mem");
  const auto task=use_mem ? memory_task
                          : use_mix ? mixed_task
                                    : compute_task;
  const auto task_name=use_mem ? "memory_task"
                               : use_mix ? "mixed_task"
                                         : "compute_task";

  //~~~~ prepare input/output data ~~~~
  auto results=std::vector<double>(thread_count);
  auto indices=std::vector<int>(count);
  auto inputs=std::vector<double>(count);
  std::generate(begin(indices), end(indices),
    [i=0]() mutable { return i++; });
  std::copy(cbegin(indices), cend(indices), begin(inputs));
  std::shuffle(begin(indices), end(indices), // fight the prefetcher!
               std::default_random_engine{std::random_device{}()});

  //~~~~ launch threads ~~~~
  std::cout << thread_count << " threads"<< (use_ht ? " (ht)" : "")
            << " running " << task_name << "()\n";
  auto threads=std::vector<std::thread>(thread_count);
  const auto t0=system_time();
  for(auto i=0; i<thread_count; ++i)
  {
    threads[i]=std::thread{task, i, thread_count,
                           data(indices), data(inputs), data(results)};
  }

  //~~~~ wait for threads ~~~~
  auto result=0.0;
  for(auto i=0; i<thread_count; ++i)
  {
    threads[i].join();
    result+=results[i];
  }
  const auto duration=system_time()-t0;
  std::cout << "result: " << result << '\n';
  std::cout << "duration: " << duration << " seconds\n";
  return 0;
}

@AlekDepler I just added an example to compare performances with/without HT in different situations. — prog-fh, Nov 28 '19 at 15:28
Hmm, that is interesting... It seems you are right, mixed mode leading to huge performance boost. Moreover, this boost is variable value, which depends on specific algorithm. I think that std::cos() implementation also do some kind of mixing data in memory and because of that also leads to 30% speed up when HT enabled. — , Nov 28 '19 at 22:01
@AlekDepler I think that the floating-point machinery inside a core is much more complex than we imagine and there are latencies that are not due to memory access in this case. These latencies are good opportunities for HT. — prog-fh, Nov 28 '19 at 22:16

score 1 · Answer 2 · answered Nov 28 '19 at 09:12

TLDR: SMT/HT is a technology that exists to offset the cost of massive multithreading as opposed to speeding up your computation with more cores.

You have misunderstood what SMT/HT does.

"As far as I noticed there is no difference with running code on 6 or 12 threads for 6cores-12threads CPU".

If this is true, then SMT/HT is working.

To understand why, you need to understand modern OS kernels and Kernel Threads. Today's Operating Systems use what is called Preemptive Threading.

The OS Kernel divides up each core into time-slices called "Quantum", and using interrupts schedules the various processes in a complicated round robin fashion.

The part we want to look at is the interrupt. When a CPU core is scheduled to switch run another thread, we call this process a "Context Switch". Context Switches are expensive, slow processes, as the entire state and flow of the highly pipelined CPU must be stopped, saved and swapped out for another state (as well as other caches, registers, lookup tables etc). According to this answer, Context Switch times are measured in microseconds (thousands of clock-cycles); and will only get worse as CPUs become more complicated.

The point of SMT/HT is to cheat, by having each CPU core being able to store two states at the same time (imagine having two monitors instead of one, you still only use one at time, but you are more productive because you don't need to rearrange your windows each time you switch tasks). So SMT/HT processors can Context Switch must faster than non-SMT/HT processors.

So back to your example. If you turned off SMT on your Ryzen 2600, then ran the same workload with 12 threads, you will find that it performs significantly slower than with 6 threads.

Also, note, more threads does not make things faster.

AMD SMT or Intel HT performance

2 Answers2