2

I am tunning my GEMM code and comparing with Eigen and MKL. I have a system with four physical cores. Until now I have used the default number of threads from OpenMP (eight on my system). I assumed this would be at least as good as four threads. However, I discovered today that if I run Eigen and my own GEMM code on a large dense matrix (1000x1000) I get better performance using four threads instead of eight. The efficiency jumped from 45% to 65%. I think this can be also seen in this plot https://plafrim.bordeaux.inria.fr/doku.php?id=people:guenneba

The difference is quite substantial. However, the performance is much less stable. The performance jumps around quit a bit each iteration both with Eigen and my own GEMM code. I'm surprised that Hyperthreading makes the performance so much worse. I guess this is not not a question. It's an unexpected observation which I'm hoping to find feedback on.

I see that not using hyper threading is also suggested here.
How to speed up Eigen library's matrix product?

I do have a question regarding measuring max performance. What I do now is run CPUz and look at the frequency as I'm running my GEMM code and then use that number in my code (4.3 GHz on one overclocked system I use). Can I trust this number for all threads? How do I know the frequency per thread to determine the maximum? How to I properly account for turbo boost?

Community
  • 1
  • 1

3 Answers3

4

The purpose of hyperthreading is to improve CPU usage for code exhibiting high latency. Hyperthreading masks this latency by treating two threads at once thus having more instruction level parallelism.

However, a well written matrix product kernel exhibits an excellent instruction level parallelism and thus exploits nearly 100% of the CPU ressources. Therefore there is no room for a second "hyper" thread, and the overhead of its management can only decrease the overall performance.

ggael
  • 28,425
  • 2
  • 65
  • 71
  • When I run my Mandelbrot set drawing code using AVX and all cores hyper-theading gives a big boost. Much larger then I expected. But when I run my GEMM code it gives a much larger decrease in performance then I expected. I though it could only make things slightly worse - not enough to worry about. Turns out that assumption was wrong. I'm still not sure why the results are so unstable though with hyper-threading. Using OpenMP, I get 45% efficiency consistent with hyper-threading (eight threads) and between 35% and 65% (four threads) without. –  Apr 19 '13 at 15:36
2

Unless I've missed something, always possible, your CPU has one clock shared by all its components so if you measure it's rate at 4.3GHz (or whatever) then that's the rate of all the components for which it makes sense to figure out a rate. Imagine the chaos if this were not so, some cores running at one rate, others at another rate; the shared components (eg memory access) would become unmanageable.

As to hyperthreading actually worsening the performance of your matrix multiplication, I'm not surprised. After all, hyperthreading is a poor-person's parallelisation technique, duplicating instruction pipelines but not functional units. Once you've got your code screaming along pushing your n*10^6 contiguous memory locations through the FPUs a context switch in response to a pipeline stall isn't going to help much. At best the other pipeline will scream along for a while before another context switch robs you of useful clock cycles, at worst all the careful arrangement of data in the memory hierarchy will be horribly mangled at each switch.

Hyperthreading is designed not for parallel numeric computational speed but for improving the performance of a much more general workload; we use general-purpose CPUs in high-performance computing not because we want hyperthreading but because all the specialist parallel numeric CPUs have gone the way of all flesh.

High Performance Mark
  • 77,191
  • 7
  • 105
  • 161
  • Thanks, yes, the clock should be the same for all cores. I misread some tables. I'm learning about turbo-boost and multipliers now. Part of the problem was that I am using rdtsc to estimate the frequency in my code and comparing to CPUz and they don't agree. But rdtsc I don't think accounts for turboboost so that explains it. My main question is if I should use the value report by CPUz (while running my code) in my calculation to estimate the max GFLOPs/s? –  Apr 19 '13 at 15:27
1

As a provider of multithreaded concurrency services, I have explored how hyperthreading affects performance under a variety of conditions. I have found that with software that limits its own high-utilization threads to no more that the actual physical processors available, the presence or absence of HT makes very little difference. Software that attempts to use more threads than that for heavy computational work, is likely unaware that it is doing so, relying on merely the total processor count (which doubles under HT), and predictably runs more slowly. Perhaps the largest benefit that enabling HT may provide, is that you can max out all physical processors, without bringing the rest of the system to a crawl. Without HT, software often has to leave one CPU free to keep the host system running normally. Hyperthreads are just more switchable threads, they are not additional processors.

Chris Cochran
  • 321
  • 2
  • 9