47

Well looks too simple a question to be asked but i asked after going through few ppts on both.

Both methods increase instruction throughput. And Superscaling almost always makes use of pipelining as well. Superscaling has more than one execution unit and so does pipelining or am I wrong here?

Michael Petrotta
  • 59,888
  • 27
  • 145
  • 179
Alex Xander
  • 3,903
  • 14
  • 36
  • 43
  • 4
    I removed all the comments that weren't on-topic to the question. That didn't leave any. Please keep it civil people. – Marc Gravell Nov 01 '09 at 08:56
  • Good idea. Otherwise a perfectly good question would have been closed as "subjective and argumentative"! – RCIX Nov 01 '09 at 09:48

5 Answers5

69

Superscalar design involves the processor being able to issue multiple instructions in a single clock, with redundant facilities to execute an instruction. We're talking about within a single core, mind you -- multicore processing is different.

Pipelining divides an instruction into steps, and since each step is executed in a different part of the processor, multiple instructions can be in different "phases" each clock.

They're almost always used together. This image from Wikipedia shows both concepts in use, as these concepts are best explained graphically:

Superscalar/pipelining in use

Here, two instructions are being executed at a time in a five-stage pipeline.


To break it down further, given your recent edit:

In the example above, an instruction goes through 5 stages to be "performed". These are IF (instruction fetch), ID (instruction decode), EX (execute), MEM (update memory), WB (writeback to cache).

In a very simple processor design, every clock a different stage would be completed so we'd have:

  1. IF
  2. ID
  3. EX
  4. MEM
  5. WB

Which would do one instruction in five clocks. If we then add a redundant execution unit and introduce superscalar design, we'd have this, for two instructions A and B:

  1. IF(A) IF(B)
  2. ID(A) ID(B)
  3. EX(A) EX(B)
  4. MEM(A) MEM(B)
  5. WB(A) WB(B)

Two instructions in five clocks -- a theoretical maximum gain of 100%.

Pipelining allows the parts to be executed simultaneously, so we would end up with something like (for ten instructions A through J):

  1. IF(A) IF(B)
  2. ID(A) ID(B) IF(C) IF(D)
  3. EX(A) EX(B) ID(C) ID(D) IF(E) IF(F)
  4. MEM(A) MEM(B) EX(C) EX(D) ID(E) ID(F) IF(G) IF(H)
  5. WB(A) WB(B) MEM(C) MEM(D) EX(E) EX(F) ID(G) ID(H) IF(I) IF(J)
  6. WB(C) WB(D) MEM(E) MEM(F) EX(G) EX(H) ID(I) ID(J)
  7. WB(E) WB(F) MEM(G) MEM(H) EX(I) EX(J)
  8. WB(G) WB(H) MEM(I) MEM(J)
  9. WB(I) WB(J)

In nine clocks, we've executed ten instructions -- you can see where pipelining really moves things along. And that is an explanation of the example graphic, not how it's actually implemented in the field (that's black magic).

The Wikipedia articles for Superscalar and Instruction pipeline are pretty good.

Community
  • 1
  • 1
Jed Smith
  • 15,584
  • 8
  • 52
  • 59
  • 4
    They are used together primarily because both techniques are available, both are good ideas and modern process manufacturing technology makes it possible. Notable chips that are pipelined but not super-scalar include the Intel i486 and some of the early ARM, MIPS CPUs as well as the first Alpha processor. – Paul Hsieh Nov 01 '09 at 08:20
  • 2
    The first "execute" should be an "issue" and then you can use "execute" instead of "do". That's how that phase is called in the Henessy & Patterson book. – yeyeyerman Nov 01 '09 at 08:35
  • 1
    @yeyeyerman: Thank you for the feedback, I have revised the answer. I haven't had much exposure to texts on the material, so forgive the oversight. – Jed Smith Nov 01 '09 at 08:37
  • _redundant_ means "superfluous", "unnecessary", or "not strictly necessary to functioning but included in case of failure in another component." But the functional units on a superscalar don't even need to provide overlapping functionality (for example in the case where you have separate branch unit, ALU and memory unit.) – Wandering Logic Jul 03 '14 at 14:51
  • If I get this correctly, then that means that the Wikipedia example is doing vector processing using pipelining, when it could issue different instructions per cycle? I am talking about the two instruction execution units. See here - http://imgur.com/gPsVAWY – goelakash Apr 17 '15 at 07:24
  • What, then, is the difference between a single core superscalar processor and a multicore scalar processor? – lodhb May 18 '15 at 09:09
37

A long time ago, CPUs executed only one machine instruction at a time. Only when it was completely finished did the CPU fetch the next instruction from memory (or, later, the instruction cache).

Eventually, someone noticed that this meant that most of a CPU did nothing most of the time, since there were several execution subunits (such as the instruction decoder, the integer arithmetic unit, and FP arithmetic unit, etc.) and executing an instruction kept only one of them busy at a time.

Thus, "simple" pipelining was born: once one instruction was done decoding and went on towards the next execution subunit, why not already fetch and decode the next instruction? If you had 10 such "stages", then by having each stage process a different instruction you could theoretically increase the instruction throughput tenfold without increasing the CPU clock at all! Of course, this only works flawlessly when there are no conditional jumps in the code (this led to a lot of extra effort to handle conditional jumps specially).

Later, with Moore's law continuing to be correct for longer than expected, CPU makers found themselves with ever more transistors to make use of and thought "why have only one of each execution subunit?". Thus, superscalar CPUs with multiple execution subunits able to do the same thing in parallel were born, and CPU designs became much, much more complex to distribute instructions across these fully parallel units while ensuring the results were the same as if the instructions had been executed sequentially.

Ajay
  • 18,086
  • 12
  • 59
  • 105
Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
  • 8
    Its answers like these that should end the debate going on about value of such questions on SO. – Alex Xander Nov 01 '09 at 08:40
  • 2
    A long time ago, in a die far, far away? – Jed Smith Nov 01 '09 at 08:50
  • 1
    I'd vote this up but the description of superscalar CPUs is incorrect. You're describing a vector processor, superscalar processors are subtly different. – Wedge Nov 01 '09 at 09:10
  • Now that calls for another question - what is difference between vector and superscaler processors? – AJ. Nov 01 '09 at 09:24
  • @Ankit vector processing is lock-step and explicit, one instruction with multiple data. Add these 5 numbers to those 5 numbers, etc. To take advantage of different degrees of parallelism (e.g. 12 simultaneous additions) you'd need to recompile and/or redesign your code. A superscalar processor determines on its own how to run code in parallel, so it can run unmodified scalar code or code written for other superscalar processors with different degrees of parallelism. – Wedge Nov 01 '09 at 09:46
  • I don't think my description contains any statement that is false for superscalar CPUs - I just didn't want to go into too much detail. – Michael Borgwardt Nov 01 '09 at 10:05
  • @MichaelBorgwardt: I think there is something wrong in the statement: "If you had 10 such "stages", then by having each stage process a different instruction you could theoretically increase the instruction throughput tenfold without increasing the CPU clock at all! " If you dont change(reduce by nearly 10x) your CPU clock from non-pipelined to pipelined version, how can pipeline benefit in overall execution time ? It only benefits when each stage is running at clock 10x faster !! – nurabha Nov 02 '14 at 19:36
  • @nurabha: you seem to mistakenly believe that the CPU clock is the rate at which entire instructions are executed. That is not true. It's the rate at which certain CPU-internal operations are executed that are typically only a *part* of executing an instruction, i.e. the stages mentioned above. If you have 10 stages that each take 1 CPU cycle (a simplification), then without pipelining it takes 10 cycles to execute one instruction, and only then the next one can start executing. If you do it pipelined, each instruction still takes 10 cycles, but you can have 10 running at the same time. – Michael Borgwardt Nov 03 '14 at 08:54
  • @MichaelBorgwardt: Here is some math. Assume f = 100 Mhz or Tclk = 10 ns in both non-pipelined and pipelined processors. Compare total execution time and throughput for 100 instructions(similar non-dependent instructions). In non-pipelined processor, Total Execution time = 100 instructions x 10 ns = 1000 ns ~ 100 clock cycles. Throughput IPC = 100 Instr/100 cc = 1. In pipelined processor, Total Execution time = 1 instr. x 10 stages x 10 ns + 99 instr. x 1 stage x 10 ns = 1090 ns ~ 109 clock cycles. Througput IPC = 100 Instr/109 cc <= 1. – nurabha Nov 10 '14 at 07:44
  • @MichaelBorgwardt: Maybe I am still incorrect, but please explain how to improve IPC or execution time without reducing clock cycle time ? The clock cycle time can't be reduced in non-pipelined processor beyond a point because it is limited by delay of the critical path in the combinational datapath. When I pipeline this datapath, I am actually cutting the combinational datapath into 10 stages and each stage can be run at either the same frequency(1x) as before or at higher frequency(<10x). If I run each combinational stg at same clk frequency as before, I get no advantage from pipelining. – nurabha Nov 10 '14 at 07:55
  • @nurabha: Did you even read what I wrote? You are still assuming that a non-pipelined processor can execute 1 instruction per clock cycle. It can't, not even close. Assume rather that it can execute one *stage* per clock cycle. So that's a throughput of 0.1 IPC because each instruction passes through one stage after another while the other stages are idle. Using pipelining, you can keep all stages occupied and achieve a throughput of nearly 1 IPC. – Michael Borgwardt Nov 10 '14 at 08:52
  • @MichaelBorgwardt: I think about non-pipelined processor, you are assuming a multi-cycle combination data-path and I am assuming a single cycle datapath. So for me, non-pipelined processor is a single stage processor whose clock frequency is decided my slowest instruction. If I assume a multi-cycle non-pipelined processor, than clock is decided by slowest stage. This brings me to the question as to why would I make combinational path of a single stage processor as multi-cycle path without pipelining it...what is the advantage? – nurabha Nov 10 '14 at 11:31
  • @nurabha: I'm not sure I understand your terminology completely, but the point of having multiple stages without pipelining is that not all instructions use all stages (most obviously memory access), and stages can take more than one cycle, even a variable number of cycles. Thus, you can execute some instructions much more quickly than others. Of course, pipelining is then an obvious optimization, but it adds complexity (especially in handling conditional jumps) which is why you might not want to do it. – Michael Borgwardt Nov 10 '14 at 11:59
  • 1
    @nurabha: in practice, some forms of pipelining were done very early, and the real question is how deep the pipeline of a given processor is. I think the Pentium IV had a pretty extreme one with 40+ stages. – Michael Borgwardt Nov 10 '14 at 12:01
14

An Analogy: Washing Clothes

Imagine a dry cleaning store with the following facilities: a rack for hanging dirty or clean clothes, a washer and a dryer (each of which can wash one garment at a time), a folding table, and an ironing board.

The attendant who does all of the actual washing and drying is rather dim-witted so the store owner, who takes the dry cleaning orders, takes special care to write out each instruction very carefully and explicitly.

On a typical day these instructions may be something along the lines of:

  1. take the shirt from the rack
  2. wash the shirt
  3. dry the shirt
  4. iron the shirt
  5. fold the shirt
  6. put the shirt back on the rack
  7. take the pants from the rack
  8. wash the pants
  9. dry the pants
  10. fold the pants
  11. put the pants back on the rack
  12. take the coat from the rack
  13. wash the coat
  14. dry the coat
  15. iron the coat
  16. put the coat back on the rack

The attendant follows these instructions to the tee, being very careful not to ever do anything out of order. As you can imagine, it takes a long time to get the day's laundry done because it takes a long time to fully wash, dry, and fold each piece of laundry, and it must all be done one at a time.

However, one day the attendant quits and a new, smarter, attendant is hired who notices that most of the equipment is laying idle at any given time during the day. While the pants were drying neither the ironing board nor the washer were in use. So he decided to make better use of his time. Thus, instead of the above series of steps, he would do this:

  1. take the shirt from the rack
  2. wash the shirt, take the pants from the rack
  3. dry the shirt, wash the pants
  4. iron the shirt, dry the pants
  5. fold the shirt, (take the coat from the rack)
  6. put the shirt back on the rack, fold the pants, (wash the coat)
  7. put the pants back on the rack, (dry the coat)
  8. (iron the coat)
  9. (put the coat back on the rack)

This is pipelining. Sequencing unrelated activities such that they use different components at the same time. By keeping as much of the different components active at once you maximize efficiency and speed up execution time, in this case reducing 16 "cycles" to 9, a speedup of over 40%.

Now, the little dry cleaning shop started to make more money because they could work so much faster, so the owner bought an extra washer, dryer, ironing board, folding station, and even hired another attendant. Now things are even faster, instead of the above, you have:

  1. take the shirt from the rack, take the pants from the rack
  2. wash the shirt, wash the pants, (take the coat from the rack)
  3. dry the shirt, dry the pants, (wash the coat)
  4. iron the shirt, fold the pants, (dry the coat)
  5. fold the shirt, put the pants back on the rack, (iron the coat)
  6. put the shirt back on the rack, (put the coat back on the rack)

This is superscalar design. Multiple sub-components capable of doing the same task simultaneously, but with the processor deciding how to do it. In this case it resulted in a nearly 50% speed boost (in 18 "cycles" the new architecture could run through 3 iterations of this "program" while the previous architecture could only run through 2).

Older processors, such as the 386 or 486, are simple scalar processors, they execute one instruction at a time in exactly the order in which it was received. Modern consumer processors since the PowerPC/Pentium are pipelined and superscalar. A Core2 CPU is capable of running the same code that was compiled for a 486 while still taking advantage of instruction level parallelism because it contains its own internal logic that analyzes machine code and determines how to reorder and run it (what can be run in parallel, what can't, etc.) This is the essence of superscalar design and why it's so practical.

In contrast a vector parallel processor performs operations on several pieces of data at once (a vector). Thus, instead of just adding x and y a vector processor would add, say, x0,x1,x2 to y0,y1,y2 (resulting in z0,z1,z2). The problem with this design is that it is tightly coupled to the specific degree of parallelism of the processor. If you run scalar code on a vector processor (assuming you could) you would see no advantage of the vector parallelization because it needs to be explicitly used, similarly if you wanted to take advantage of a newer vector processor with more parallel processing units (e.g. capable of adding vectors of 12 numbers instead of just 3) you would need to recompile your code. Vector processor designs were popular in the oldest generation of super computers because they were easy to design and there are large classes of problems in science and engineering with a great deal of natural parallelism.

Superscalar processors can also have the ability to perform speculative execution. Rather than leaving processing units idle and waiting for a code path to finish executing before branching a processor can make a best guess and start executing code past the branch before prior code has finished processing. When execution of the prior code catches up to the branch point the processor can then compare the actual branch with the branch guess and either continue on if the guess was correct (already well ahead of where it would have been by just waiting) or it can invalidate the results of the speculative execution and run the code for the correct branch.

Wedge
  • 19,513
  • 7
  • 48
  • 71
6

Pipelining is what a car company does in the manufacturing of their cars. They break down the process of putting together a car into stages and perform the different stages at different points along an assembly line done by different people. The net result is that the car is manufactured at exactly the speed of the slowest stage alone.

In CPUs the pipelining process is exactly the same. An "instruction" is broken down into various stages of execution, usually something like 1. fetch instruction, 2. fetch operands (registers or memory values that are read), 2. perform computation, 3. write results (to memory or registers). The slowest of this might be the computation part, in which case the overall throughput speed of the instructions through this pipeline is just the speed of the computation part (as if the other parts were "free".)

Super-scalar in microprocessors refers to the ability to run several instructions from a single execution stream at once in parallel. So if a car company ran two assembly lines then obviously they could produce twice as many cars. But if the process of putting a serial number on the car was at the last stage and had to be done by a single person, then they would have to alternate between the two pipelines and guarantee that they could get each done in half the time of the slowest stage in order to avoid becoming the slowest stage themselves.

Super-scalar in microprocessors is similar but usually has far more restrictions. So the instruction fetch stage will typically produce more than one instruction during its stage -- this is what makes super-scalar in microprocessors possible. There would then be two fetch stages, two execution stages, and two write back stages. This obviously generalizes to more than just two pipelines.

This is all fine and dandy but from the perspective of sound execution both techniques could lead to problems if done blindly. For correct execution of a program, it is assumed that the instructions are executed completely one after another in order. If two sequential instructions have inter-dependent calculations or use the same registers then there can be a problem, The later instruction needs to wait for the write back of the previous instruction to complete before it can perform the operand fetch stage. Thus you need to stall the second instruction by two stages before it is executed, which defeats the purpose of what was gained by these techniques in the first place.

There are many techniques use to reduce the problem of needing to stall that are a bit complicated to describe but I will list them: 1. register forwarding, (also store to load forwarding) 2. register renaming, 3. score-boarding, 4. out-of-order execution. 5. Speculative execution with rollback (and retirement) All modern CPUs use pretty much all these techniques to implement super-scalar and pipelining. However, these techniques tend to have diminishing returns with respect to the number of pipelines in a processor before stalls become inevitable. In practice no CPU manufacturer makes more than 4 pipelines in a single core.

Multi-core has nothing to do with any of these techniques. This is basically ramming two micro-processors together to implement symmetric multiprocessing on a single chip and sharing only those components which make sense to share (typically L3 cache, and I/O). However a technique that Intel calls "hyperthreading" is a method of trying to virtually implement the semantics of multi-core within the super-scalar framework of a single core. So a single micro-architecture contains the registers of two (or more) virtual cores and fetches instructions from two (or more) different execution streams, but executing from a common super-scalar system. The idea is that because the registers cannot interfere with each other, there will tend to be more parallelism leading to fewer stalls. So rather than simply executing two virtual core execution streams at half the speed, it is better due to the overall reduction in stalls. This would seem to suggest that Intel could increase the number of pipelines. However this technique has been found to be somewhat lacking in practical implementations. As it is integral to super-scalar techniques, though, I have mentioned it anyway.

Paul Hsieh
  • 1,466
  • 7
  • 9
2

Pipelining is simultaneous execution of different stages of multiple instructions at the same cycle. It is based on splitting instruction processing into stages and having specialized units for each stage and registers for storing intermediate results.

Superscaling is dispatching multiple instructions (or microinstructions) to multiple executing units existing in CPU. It is based thus on redundant units in CPU.

Of course, this approaches can complement each other.

elder_george
  • 7,849
  • 24
  • 31