0

Is it possible to query the number of execution unit/port per core and similar information on Intel CPU?

I have an assembly program, and noticed that the performance is quite different on different CPU's. For example, on an Core i5 4570, some functions takes consistently 25% cycles to complete than on an Core i7 4970HQ. They are both Haswell based, from the same generation. No memory movement is involved in the part of program benchmarked. So I am thinking maybe the difference comes from the details such as number of execution unit, number of ports etc. The benchmark measures single core CPU cycles, so frequencies/HT etc does not come into play.

Am I right to assume such an explanation of performance difference? If yes, where can I find such informations for specific CPUs. And is it possible to query it dynamically? If possible, then I can dispatch dynamically based on such informations and distribution uops more evenly and similar techniques to optimize the program for multiple CPUs.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Yan Zhou
  • 2,709
  • 2
  • 22
  • 37
  • Why don't you provide a link to your benchmark, or post the full code so we can try it our ourselves? Otherwise we are just taking stabs in the dark. – BeeOnRope Nov 15 '16 at 03:40

1 Answers1

2

Did you time reference cycles (RDTSC) instead of core clock cycles (with perf counters)? That would explain your observations.

Turbo makes a big difference, and the ratio between max turbo and max sustained / rated clock speed (i.e. reference cycle tick rate) is different on different CPUs. e.g. see my answer on this related question

The lower the CPU's TDP, the bigger the ratio between sustained and peak. The Haswell wikipedia article has tables:

  • 84W desktop i5 4570: sustained 3.2GHz = RDTSC frequency, max turbo 3.6GHz (the speed the core was probably actually running for most of your benchmark, if it had time to go up from low-power idle speed).

  • 47W laptop i7-4960HQ: 2.6GHz sustained = RDTSC frequency vs. 3.8GHz max turbo.

Time your code with performance counters, and look at the "core clock cycles" count. (And lots of other neat stuff).


Every Haswell core is identical from Core-M 5Watt CPUs to high-power quad core to 18-core Xeon (which actually has a per-core power-budget more like a laptop CPU); it's only the L3 caches, number of cores (and interconnect), and support or not for HT and/or Turbo that differ. Basically everything outside the cores themselves can be different, including the GPU. They don't disable execution ports, and even the L1/L2 caches are identical. I think disabling execution ports would require significant redesigns in the out-of-order scheduler and stuff like that.

More importantly, every port has at least one execution unit that isn't found on any other port: p0 has the divider, p1 has the integer multiply unit, p5 has the shuffle unit, and p6 is the only port that can execute predicted-taken branches. Actually, p2 and p3 are identical load ports (and can handle store-address uops)...

See Agner Fog's microarch pdf for more about Haswell internals, and also David Kanter's writeup with diagrams of the different blocks.

(However, it's not strictly true that the entire core is identical: Haswell Pentium/Celeron CPUs don't support AVX/AVX2, or BMI/BMI2. I think they do that by disabling decode of VEX prefixes in the decoders. This is still the case for Skylake Pentiums/Celerons, so thanks Intel for delaying the time when we can assume support for new instruction sets. Presumably they do this so CPUs with defects in one only the upper or lower half of their vector execution units can still be sold as Celeron or Pentium, just like CPUs with a defect in some of their L3 can be sold as i5 instead of i7)

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thanks for the detailed answer. It should be 4960HQ, it was a typo. This is model number required by CPUID. I timed with RDTSC. More precisely one entry, the cycle count is read with CPUID as barrier and followed by RDTSC. And on exit, the count is read with RDTSCP and then followed by CPUID as barrier. – Yan Zhou Nov 14 '16 at 03:45
  • 1
    @YanZhou: `rdtsc` hasn't measured in cycles for several CPU generations – Ben Voigt Nov 14 '16 at 03:56
  • After some trial, I think the easiest way is to simply disable turbo boost and possibly other non-deterministic behaviors of the CPU when optimizing code when I need an accurate count of cycles. I tried PMC and similar methods, but sometime they just don't get reliable results. Besides, I did not find anyway to start counters without privileged access. – Yan Zhou Nov 15 '16 at 03:52
  • @YanZhou: RDTSC does work if you disable turbo and all that, but usually that's overkill. On Linux, I get good repeatability with `perf stat ./a.out` Linux includes an API for enabling perf counters (and timing mostly just a single process even across context switches). If you do it right, you can get extremely accurate and repeatable results. (e.g. http://stackoverflow.com/questions/39311872/is-performance-reduced-when-executing-loops-whose-uop-count-is-not-a-multiple-of). – Peter Cordes Nov 15 '16 at 04:11
  • A few people have written Linux kernel modules to allow user-space processes to control perf counters, like [libpfc](https://github.com/obilaniu/libpfc). – Peter Cordes Nov 15 '16 at 04:12
  • @YanZhou: Of course, the most useful thing about using perf counters is that you can look at counters other than clock cycles, to figure out whether you're hitting a frontend bottleneck or a bottleneck on a specific execution port, or what. And if applicable, to find branch-mispredict hotspots, and cache misses. – Peter Cordes Nov 15 '16 at 04:30