-1

I can't well understand about CPU clock such as 3.4Ghz. I know this is that 3.4 billions clock cycle per second.

So here if machine use single clock cycle instruction, then It can execute about 3.4 billions instructions per second.

But in pipeline, basically it needs more cycles per instruction, but each cycle length is shorter than single clock cycle.

But although pipeline has more throughput, anyway cpu can do 3.4 billions cycle per second. So, it can execute 3.4 billions/5 instructions(if one instruction needs 5 cycles), which means less than single cycle implementation(3.4 > 3.4/5). What am I missing?

Does CPU clock such as 3.4Ghz just means for based on pipeline cycle, not for based on single cycle implentation?

allen
  • 69
  • 3
  • 12

3 Answers3

5

Pipelining

Pipelining doesn't involve cycles shorter than a single clock cycle. Here's how pipelining works:

enter image description here

We have a complicated task to do. We take that task and break it down into a number of stages, each of which is relatively simple to carry out. We study the amount of work in each stage to make sure each stage takes about the same amount of time as any other.

With a processor, we do roughly the same thing--but in this case, it's not "install these fourteen bolts", it's things like fetching and decoding instructions, reading operands, executing (often a couple of stages here), and writing back results.

Like the automotive production line, we provide each stage of the pipeline with a specialized set of tools for doing exactly (and only) what is needed at that stage. When we finish doing one stage of processing on a car/instruction, it moves along to the next stage, and this stage gets the next car/instruction to process.

In an ideal situation, the process works (roughly) like this:

enter image description here

It took Ford about 12 hours to build one N car (the predecessor to the model T). Thanks primarily to pipelining the production line, it took only about 2 and a half hours to build a Model T. More importantly, even though a model T took 2.5 hours start to finish, that time was broken down into no fewer than 84 discrete steps, so when everything ran smoothly the production line as a whole could produce another car (about) every two minutes.

That didn't always happen though. If one stage ran short of parts, the stages after it had to wait. If the pause lasted very long, it would back things up so the preceding stages had to wait too.

The same can happen in a processor pipeline. For example, when a branch happens, the processor may have to wait a while before the next instruction can be fetched. If an instruction needs an operand from memory, that can lead to a pause (a "pipeline bubble") as well.

To prevent pauses in his pipeline, Henry Ford hired people to study the stages, figure out how many of each kind of part would need to be on hand for each stage, and so on. I don't know for sure, but I think it's a fair guess that there were probably a few people designated to watch the supply of parts at different stations, and send somebody running to let a warehouse manager know if (for whatever reason) the supply of parts for a particular stage looked like it was running short so they'd need more soon.

Processors do a little of the same thing--they have things like branch predictors and prefetchers that attempt to figure out ahead of time what will be needed by the stream of instructions being executed, and trying to ensure that everything is on hand when its needed (with caches, for example, to temporarily store things that seem likely to be needed soon).

So, like the Model T, it takes some relatively long amount of time for each instruction to execute start to finish, but we get another product finished at much shorter intervals--ideally once a clock (but see my other answer--modern designs often execute more than one instruction per clock).

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • See also https://en.wikipedia.org/wiki/Instruction_pipelining#Illustrated_example. I would have just linked the right Wikipedia article with a brief explanation of the OP's misunderstanding for computer architecture 101 questions like this. The OP should consider himself lucky that you and Gabriel went to so much effort for this question! – Peter Cordes Jul 16 '16 at 08:46
3

A typical modern CPU can execute a number of unrelated instructions (those that don't depend on the same resources) concurrently.

To do that, it typically ends up with a basic structure vaguely like this:

enter image description here

So, we have an instruction stream coming in on the left. We have three decoders, each of which can decode one instruction each clock cycle (but there may be limitations, so complex instructions all have to pass through one decoder, and the other two decoders can only do simple instructions).

From there, the instructions pass into a reorder buffer, which keeps a "scoreboard" of which resources are used by each instruction, and which resources are affected that instruction (where a "resource" would typically be something like a CPU register or a flag in the flags register).

The circuitry then compares those scoreboards to determine dependencies. For example, if one instruction writes to register 0, and a later one reads from register 0, then those instructions must execute serially. At each clock, it tries to find the N oldest instructions that don't have dependencies for execution.

There are then a number of independent execution units. Each of these is basically a "pure" function--it takes some inputs, carries out a specified transformation on it, and produces an output. This makes it easy to replicate them as needed, and have as many running in parallel as we want/can afford. Those are typically grouped, with one port going to each group. In each clock, we can send one instruction through that port to one of the execution units in that group. Once an instruction arrives at the execution unit, it may take more than one clock to finish execution.

Once those execute, we have a set of retirement units that take the results, and write them back to the registers in execution order. Again we have multiple units so we can retire multiple instructions per clock.

Note: this drawing tries to be semi-realistic about the rough number of decoders, retirement units, and ports that it depicts, but what it shows is a general idea--different CPUs will have more or fewer specific resources. For almost any of them, the number of decoded instructions in the scoreboard units is low though--a realistic number would be more like 50 instructions.

In any case, actual execution of instructions is one of the hardest parts of this to measure or reason about. The number of ports gives us a hard upper limit on the number of instructions that can start executing in any given clock. The number of decoders and retirement units give an upper limit on the number of instructions that can be started/finished per clock. The execution itself...well, there are a lot of execution units, and each one (at least potentially) takes a different number of clocks to execute an instruction.

With the design as shown above, you'd have a hard upper limit of three instructions per clock. That's the most you can decode or retire. With a different design, that could obviously go up or down (e.g., with 4 decoders, 4 ports and 4 retirement units, the upper limit could go up to 4).

Realistically, with that design you wouldn't normally expect to see three instructions execute in most clock cycles. There are enough dependencies between instructions that you'd probably expect closer to 2 as a long term average (and much more likely a little less than 2). Increasing the available resources (more decoders, more retirement units, etc.) will rarely help that a whole lot--you might get to an average of three instructions per clock, but hoping for four is probably unrealistic.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • 4
    In other words: It's complicated. – Mysticial Jul 15 '16 at 22:20
  • 1
    Note that modern out-of-order designs use [register renaming](https://en.wikipedia.org/wiki/Tomasulo_algorithm) instead of scoreboarding, to remove false dependencies (like write-after-read, where a new dependency chain uses the same register as some independent older instructions). – Peter Cordes Jul 15 '16 at 23:29
  • @PeterCordes: Yes--there's obviously a *lot* of complexity that I've glossed over here. Register renaming removes false dependencies, but doesn't remove the necessity for tracking dependencies, just (hopefully) reduces the number of necessities. – Jerry Coffin Jul 15 '16 at 23:44
  • @JerryCoffin: yup. Obviously the full general question is way too broad to answer. That was just the main thing I had to add to your attempt at saying something useful. I think Gabriel's answer is a better answer to the OP's direct question about pipelines, but this is useful to give an idea of how much more complicated IPC is with out-of-order CPUs. – Peter Cordes Jul 15 '16 at 23:49
  • 1
    To add to what Peter wrote, this answer focuses on how [scoreboarding](https://en.wikipedia.org/wiki/Scoreboarding) works rather than how a modern out-of-order CPU operates. The term "reorder buffer" is not typically be used in this type of design. It's a minor point, but something that may not be completely technically correct about this answer. Though of course no one should expect all the details about how a CPU works to be in any answer to this question. – Gabriel Southern Jul 16 '16 at 02:04
  • 1
    @GabrielSouthern: Reorder Buffer most certainly *is* pretty routinely used in describing current designs. For example, Agner Fog's [The microarchitecture of Intel, AMD and VIA CPUs](http://www.agner.org/optimize/microarchitecture.pdf) gives 26 hits for "reorder buffer", and yes, what it's referring to (in a lot more detail) is the same basic piece of the microarchitecture discussed here. A few books split things into a reservation station and a reorder buffer, where the RS is roughly where this shows the ROB, and the ROB is after the execution units. – Jerry Coffin Jul 16 '16 at 03:17
  • @Gabriel and Jerry: In Intel terminology for their P6 and SnB-family uarches: the ROB tracks uops for their entire journey through the out-of-order part of the core, from issue/rename to retirement. The RS, aka the scheduler, only tracks uops that have issued by not executed yet. i.e. that are waiting for availability of their operands + their execution port. Once a uop is dispatched to its execution port, it's dropped from the RS. (uop->port choices are made at issue/rename time, so AFAIK it's possible to have an `add` uop not run even though there's a free ALU port.) – Peter Cordes Jul 16 '16 at 08:04
  • As always, things are more complicated than that: the RS tracks unfused uops, while the ROB holds fused uops (so e.g. an ALU + memory uop only take one slot in the ROB). [This is why micro-fused loads can still execute ahead of the other operand for the ALU uop being ready, and stuff like that](http://stackoverflow.com/questions/21134279/difference-in-performance-between-msvc-and-gcc-for-highly-optimized-matrix-multp?noredirect=1&lq=1#comment63873807_21151780) – Peter Cordes Jul 16 '16 at 08:07
  • @JerryCoffin yes I agree that "reorder buffer" or ROB is widely used terminology for most modern out-of-order CPUs. My point was that "scoreboarding" is an alternative technique that is simpler and does not use an ROB (this technique was used in ARM Cortex A8 for example). I thought what you described more closely matches this, including the fact that you use the term "scoreboard" in your text and diagram. My point was simply that scoreboarding doesn't use an ROB, but I do agree that ROB is widely used in most high performance processors today. – Gabriel Southern Jul 16 '16 at 16:22
  • @GabrielSouthern: When you use an ROB, you still have to keep track of what resources are used/affected by each instruction. If you know of a reasonably widely used term for that other than scoreboarding when it's used in conjunction with an ROB, I'll be happy to listen to suggestions. – Jerry Coffin Jul 16 '16 at 17:02
  • @JerryCoffin the more commonly used term is "reservation station" (https://en.wikipedia.org/wiki/Reservation_station). – Gabriel Southern Jul 16 '16 at 18:00
  • @GabrielSouthern: A reservation station is slightly different. When you have one, the RS tracks instructions before execution, and the ROB afterwards. Even so, that's just a name for the box you put around some functionality. I'm talking about the functionality *itself*--tracking what resources are used/affected by each instruction/uOP. "Reservation Station" doesn't cover that at all. – Jerry Coffin Jul 16 '16 at 18:19
  • @JerryCoffin: that's true. My original point was that "scoreboarding" refers to a specific technique used for scheduling and executing instructions, and that scoreboarding does not use a ROB. However, I don't think there is a single unifying term that is equivalent to "scoreboarding" when describing how an out-of-order processor works since the logic is split among a variety of structures and stages. – Gabriel Southern Jul 16 '16 at 20:48
1

As others have noted the full details of how a modern CPU operates are complicated. But part of your question has a simple answer:

Does CPU clock such as 3.4Ghz just means for based on pipeline cycle, not for based on single cycle implentation?

The clock frequency of a CPU refers to how many times per second the clock signal switches. The clock signal is not divided into smaller pipelined segments. The purpose of pipelining is to allow for faster clock switching speeds. So 3.4GHz refers to the number of times per second that a single pipeline stage can perform whatever work it needs to do when executing an instruction. The total work for executing an instruction is done over multiple cycles each of which could be in a different pipeline stage.

Your question also shows a some misconceptions about how pipelining works:

But although pipeline has more throughput, anyway cpu can do 3.4 billions cycle per second. So, it can execute 3.4 billions/5 instructions(if one instruction needs 5 cycles), which means less than single cycle implementation(3.4 > 3.4/5). What am I missing?

In the simple case the throughput of a single cycle CPU and a pipelined CPU is the same. The latency of the pipelined CPU is higher because it requires more cycles (i.e. 5 in your example) to execute a single instruction. But after the pipeline is full the throughput could be the same as for a single cycle non-pipelined CPU. So in the simple case using your example a single-cycle CPU could execute 3.4 billion instructions in 1 seconds, while the pipelined CPU with 5 stages could execute 3.4 billion minus 5 instructions in 1 second. Subtracting 5 from 3.4 billion is a negligible difference, whereas dividing by 5 would be a very significant difference.

A couple of other things to note are that the simple case I described isn't really true because of dependencies between instructions that require pipeline stalls. And most modern CPUs can execute more than one instructions per cycle.

Gabriel Southern
  • 9,602
  • 12
  • 56
  • 95
  • 1
    The major point of pipelining is that you can't build a 3.4GHz non-pipelined CPU, because there's too much work to do in a single cycle. So pipelining lets us raise the clock speed while maintaining the same max throughput (but lower throughput for code that doesn't pipeline well). – Peter Cordes Jul 15 '16 at 23:23
  • Also worth noting that you can't build a 3.4GHz CPU with a 5-stage pipeline either. That's a good baseline for understanding pipelining, but a modern high performance CPU has roughly 15 to 30 stages. – Gabriel Southern Jul 15 '16 at 23:42
  • Yup. Less aggressive designs (ARM) still get above 1GHz with under 10 pipeline stages, I think. I tried to find some concrete examples, like this [ARM1156T2-S](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0338g/I1002919.html) 9 stage pipeline, but it's only a [600MHz CPU](https://www.arm.com/products/processors/classic/arm11/arm1156.php). But yes, the higher you ramp up the clock, the fewer gate-delays and ~speed-of-light transmission-line propagation delays you can have in each pipeline stage. Higher clock speeds thus mean higher branch mispredict penalties (in cycles). – Peter Cordes Jul 15 '16 at 23:57
  • @PeterCordes: The Itanium 2 ran at 1 GHz with only 8 pipeline stages, and that was on 180 nm process. A newer process would reduce gate delay enough that I'd bet you could to around a GHz with only 5 stages, at least for a simple instruction set (not x86/x64). – Jerry Coffin Jul 16 '16 at 06:11