0

I'm looking at a chart that shows that, in reality, increasing the core count on a CPI usually results in higher CPI for most instructions, as well as that it usually increases the total amount of instructions the program executes. Why is this happening?

From my understanding CPI should increase only when increasing the clock frequency, so the CPI increase doesn't make much sense to me.

Pol
  • 185
  • 9

1 Answers1

0

What chart? What factors are they holding constant while increasing core count? Perhaps total transistor budget, so each core has to be simpler to have more cores?

Making a single core larger has diminishing returns, but building more cores has linear returns for embarrassingly parallel problems; hence Xeon Phi having lots of simple cores, and GPUs being very simple pipelines.

But CPUs that also care about single-thread performance / latency (instead of just throughput) will push into those diminishing returns and build wider cores. Many problems that we run on CPUs are not trivial to parallelize, so lots of weak cores is worse than fewer faster cores. For a given problem size, the more threads you have the more of its total time the thread will be communicating with other threads (and maybe waiting for data from them).


If you do keep each core identical when adding more cores, their CPI generally stays the same when running the same code. e.g. SPECint_rate scales nearly linearly with number of cores for current Intel/AMD CPUs (which do scale up by adding more of the same cores).

So that must not be what your chart is talking about. You'll need to clarify the question if you want a more specific answer.

You don't get perfectly linear scaling because cores do compete with each other for memory bandwidth, and space in the shared last-level cache. (Although most current designs scale up the size of last-level cache with the number of cores. e.g. AMD Zen has clusters of 4 cores sharing 8MiB of L3 that's private to those cores. Intel uses a large shared L3 that has a slice of L3 with each core, so the L3 per core is about the same.)

But more cores also means a more complex interconnect to wire them all together and to the memory controllers. Intel many-core Xeon CPUs notably have worse single-thread bandwidth than quad-core "client" chips of the same microarchitecture, even though the cores are the same in both. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • It's not a specific CPU model. Everything is the same, except the core count. It shows the ideal case first ( 2x cores -> each CPU does 1/2 of the instructions, 3x cores -> each CPU does 1/3 of the instructions etc), with the CPIs remaining the same. Afterwards it shows the 'realistic' case, where the CPIs change with the higher core count and the total amount of instructions the program runs increases. So could it be that the program isn't optimized for parallel operations then or is there a general rule that leads to higher cores leading to higher CPIs and instructions? – Pol May 07 '20 at 11:45
  • @pol: It sounds like they're talking about imperfect scaling with number of cores, i.e. not trivially parallelizable problems. I updated my answer. – Peter Cordes May 07 '20 at 12:03
  • @pol: Anyway, if you want a more specific answer, ask a more specific question. e.g. link the chart you're talking about. (And BTW, *everything* can't be the same if you're including factors like interconnect hops (ring bus or mesh size), and considerations like die area or transistor count. It sounds like you're saying they're holding the design of each core constant, and adding more of them, like the difference between a Skylake i3 dual core and a Skylake i7 quad core, and letting transistor count and ring bus hops increase. – Peter Cordes May 07 '20 at 12:09
  • It's not in English, otherwise I'd have posted it. But thanks for your answer! – Pol May 07 '20 at 12:12
  • @pol: You could still link it in your question, if it's public material on the web (not part of a textbook or something). It might be better than nothing, especially if there's any text for Google translate to work on. Or at least put some of the detail from your first comment into your question. – Peter Cordes May 07 '20 at 12:37
  • 1
    (Also, just to add to it, the 'competition' between the cores is what I remember my professor talking about to explain this) – Pol May 07 '20 at 12:38
  • @pol: Note that that completely depends on the workload. Most designs give each core its own private L1d and L1i caches, and these days usually a medium-sized L2 cache as well. A purely ALU workload where you expect no cache misses for instruction or data cache can scale perfectly, e.g. testing the [Collatz conjecture](//stackoverflow.com/questions/40354978/c-code-for-testing-the-collatz-conjecture-faster-than-hand-written-assembly), or [Prime95](https://www.mersenne.org/download/) for numbers small enough to fit in private caches. – Peter Cordes May 07 '20 at 12:43