CISC machines - don't they just convert complex instructions to RISC?

Question

Perhaps I'm at a misunderstanding in architecture - but if a machine has, say, a multiply instruction - is that instruction not translated to smaller instructions or is so complex such that it ultimately is the same speed as the equivalent RISC instructions?

Multiply is a bad example - it is a single instruction in both architectures. Replace "multiply" in the above with an instruction more complicated in CISC for which RISC has no equivalent that is a single instruction.

Multiplication is usually one instruction on both CISC and RISC. Can you clarify the question bit? — Mysticial, Jun 25 '12 at 15:50
^Yes. I was assuming multiply in assembly code was more complex than it really is - so that is a bad example. My understanding is that CISC has instructions more complicated than RISC... but how do these more complicated instructions give an advantage if, truly, they take as many cycles as the equivalent number of RISC instructions? — PinkElephantsOnParade, Jun 25 '12 at 15:56
Related: RISC vs. CISC Still Matters (by Paul DeMone): https://www.realworldtech.com/risc-vs-cisc/. Pretty good article that isn't totally out of date, written in 2000 when decode-to-uops was relatively new and Pentium II / III had overtaken Alpha for workstation performance. But the "x86 tax" is still real, especially for 32-bit x86 with legacy calling conventions and only 8 architectural registers. (reg. renaming doesn't help when you have more than 8 things to keep in regs at once.) — Peter Cordes, Jul 18 '18 at 10:28

score 3 · Answer 1 · answered Jun 26 '12 at 07:49

Multiply is both a good and bad example. First off multiply is an expensive instruction, some processors dont have one for good reason. You can and x86 and others have, take many clocks or one clock. To get a one clock multiply takes a (relatively) large amount of chip real estate (as Dani mentioned likely a dedicated block of logic, just for the multiply). Absolutely no reason why one designer would make the same choices as another, be it within the same company (one x86 compared to another) or different architectures (x86 vs arm vs mips, etc). Every designer knows that the result of a multiply is twice as many bits as the operands, so do you choose to give the programmers the full answer to all combinations of operands (result is a different size as the operands) or do you clip the result at the operand size? If you clip to you give them an overflow or exception, or do you let them keep on running without knowing the result is wrong? Do you force them to add wrappers around all mul and div instructions so that the overflow can be detected costing performance?

x86 is an incredibly bad architecture to learn first or use as a reference to others. it leads to a lot of bad assumptions. Not all processors are microcoded. Not all CISC processors are microcode. No reason why a RISC processor cant be microcoded, you can microcode either CISC or RISC or not microcode CISC or RISC, it is a design choice, not a rule.

RISC does not mean the smallest number of steps, even a simple register to register move is two steps minimum (fetch source, store result) which can take two clocks to execute the way processors are sometimes implemented (with sram banks for register files which are not necessarily dual ported). An alu instruction is three steps and can take three clocks on a RISC processor, the RISC will AVERAGE one clock per instruction, but so can an CISC. You can go superscalar and exceed one clock per instruction, at least for bursts when processor bound. The complications for going superscalar are the same for CISC vs RISC.

I suggest writing an instruction set simulator, or at least starting one. If nothing else then a disassembler. Even better take 100 programmers and have them perform the same programming task but in isolation from each other. Even if all taught at the same school by the same teachers, you are going to get somewhere between 3 and 100 different designs for that iss or disassembler. Make it a text editor as a programming task, just the programming language choices will be a bit first different then the design of the program will vary. Hardware design very much resembles software design, you use programming languages and have a compiler, and something like a linker, etc. Take a room full of hardware designers give them the same task and you get different designs. Has less to do with CISC vs RISC and a lot more to do with the design team and their choices. Intel has different design goals, reverse compatibility for example, this is a very expensive choice.

BOTH CISC and RISC convert each instruction into smaller digestible/divisible steps based on the design of the processor. Replace multiply with add, and compare CISC vs RISC at the asm level then deeper. With x86 you can use memory as operands, with arm for example you cant. so

register = memory + register

is

load register from memory
regster = register + register

You have the extra step.

But they both break down into the same sequence of steps

resolve memory address
start memory cycle,
wait for memory cycle to end,
fetch register from register memory
send operands to alu
take alu output and store in register memory

Now the cisc is actually slightly faster because the risc to execute the instructions properly would need to store the value read from memory in the extra register (cisc two registers from the asm perspective, risc, three or two with on reused).

If the value being read from memory is not aligned, then the cisc wins on a technicality (if the risc does not allow unaligned transfers generally). It takes the cisc processor the same number of memory cycles to fetch the unaligned data all things held equal (takes both processors two memory cycles, the cisc is punished as well as risc). But to focus on asm instructions vs asm instructions, if the memory operand were unaligned the risc would have to do this

read memory to register a
read memory to register b
shift a, 
shift b,
or/add

where the cisc does:

read memory to register (takes two memory cycles)

You also have instruction size, popular risc processors like arm and mips lean towards a fixed instruction length where x86 is variable. x86 can do in one byte what it takes another to do in four. Yes your fetch and decode is more complicated (more logic, more power, etc) but you can fit more instructions in the same size cache.

Microcoding does more than break one instruction set into another (the other being something likely quite painful that you would never want to program in natively). Microcoding can help you get to market faster assuming the lower level system is quicker to implement with fewer bugs. An assumption being you can ramp up production sooner because you can fix some of the bugs after the fact, and can patch in the field down the road. Not always perfect, not always a success, but compare that to a non-microcoded processor where you would have to get the compiler folks to fix the bug or recall the processor or take a black eye as a company and hope to win some customers back, etc...

So the answer is NO. Both RISC and CISC turn an individual instruction into a sequence of steps that can be microcoded or not. Think simply that they are states in a state machine implemented however you like. CISC may have some that pack more steps into one instruction, but that means less instruction fetches. And knowing the whole CISC instruction, the steps may naturally implemented more efficiently in the chip, where the RISC processor might have to examine a series of instructions and optimize on the fly to get the same number of steps. (ldr r0,[r1]; add r0,r0,r2). CISC can also be looking for the same kind of optimizations if it were to examine groups of instructions instead of focusing on one. Both use pipes and parallel execution. CISC often implies x86 and RISC implies something with a more modern and cleaner architecture. Cleaner in the sense that it is easier for the humans to program in and to implement, doesnt automatically mean faster. More steps to do the same job. x86 being variable word length with a history going back to single byte instructions, compared to say 4 byte fixed instruction length, there is a possibility that the x86 can pack more instructions into a cache than a fixed instruction length risc, giving the x86 a possible performance boost. Why doesnt risc just convert many instructions into a single smaller instruction that moves through the cache and pipeline faster?

score 2 · Answer 2 · answered Jun 25 '12 at 15:50

2

It might translate to smaller ones, but instructions that are used often like multiply will usually have a designated circuit.

answered Jun 25 '12 at 15:50

Daniel

30,896
18
85
139

score 2 · Accepted Answer · answered Jun 25 '12 at 15:58

2

Decoding circuitry of CISC machines are complex and they decode complex CISC instructions to simpler instructions. For example there could be a single CISC instruction to fetch value of two memory address and set result of multiplication to another memory address, in theory. Decoder of CISC machine decodes this single instruction to multiple RISC like operations, like fetching a value from memory location to register, adding another register to that register etc. After decoding there should be no difference. This is how current CISC machines (like x86) competes with RISC machines. But you have to pay price of a complex decoding phase.

answered Jun 25 '12 at 15:58

Deniz

1,575
1
16
27

So you pay this price of the decoding phase...for what advantage? Why ever include that when you can just cut straight to the chase with RISC instructions from the get-go? – PinkElephantsOnParade Jun 25 '12 at 16:01
@PinkElephantsOnParade `"for what advantage?"`. Let's just say that x86 shows its age. Most of the newer extensions (such as SSE) are very RISC-like. – Mysticial Jun 25 '12 at 16:14
as explained by @Mystical CISCS machines can not change their instruction sets for backward compatibility with existing applications. – Deniz Jun 25 '12 at 16:19
3

@Pink - If each CISC instruction does more work, it means that the program will be smaller. A smaller program just might run faster, like when it fits in the cache. – Bo Persson Jun 26 '12 at 11:02

score 1 · Answer 4 · answered Jul 01 '12 at 22:48

@Pink Think of it as one worker moving bricks using cart, loading 10 at once vs 10 workers lined up and handing each other bricks, so advantage is lower pay for price of cart, unless those 10s are sun-powered machines ;)

score 0 · Answer 5 · answered Jul 18 '18 at 03:16

I am just amazed with the answear from old timer. Although RISC instructions can also be broken into small steps, and they are, these steps are usualy pipelined resuting in one instruction per cycle (NOT on the AVERAGE). The SRAM that are used for the register file are almost always dual ported (with simultaneos read and writes) because this can be done at almost no cost since they are SRAM (just take a course in digital system). So it is true that how RISC processors are implemented is a design choise, and they could be 7implemented using microcode, the RISC instruction set is chosen so that they don't need to, while a pratical non-microcoded CISC is close to impossible. CISC instructions are never directly pipelined in practice but only the CISC microcode (that implements RISC like instructions).

RISC instruction sets are not easier to program by a human. They are harder to program by a human, but it is easier for a compiler to optimise them.

Correcting bugs in instructions by correcting microcode only seems reasonable if you have complex microcode for the instruction. You cannot correct a bug in a add with microcode. So, the bugs you can correct by changing microcode are the ones you don't have in RISC processors because this complex instructions are usualy implemented by software. However, in some cases it could be possible to replace a simple RISC like instruction by a microcoded instruction to fix a bug (like in a div) but only at the expense of a big performance loss.

CISC instructions could potentialy be more efficient than RISC instructions, because they can have dedicated hardware. For instance a vector move requires loads, store, increment, compare and jump instructions in RISC while it can be a single CISC instruction. The CISC processor could have an additional increment unit that would increment the address register in parallel with the load and with the compare. However, this is actualy suported in RISC machines (like ARM). In fact the basic idea behind RISC is to have code instead of microcode. This results in fewear instructions that are directly implemented in hardware, somewhat like if the programer or the compiler would write directly microcode. The downside is bigger code,

Finaly, RISC instructiob are not decomposed into smaller instructions because they are already run in a single clock cicke and at a very fast clock in the hardware.

Finaly, today high performance CISC maxinex like the x86 also decompose they instructions into RISC like microcode instrctions.

ARM is a load/store machine, but it's not very RISCy. Their push/pop (store-multiple / load-multiple) instructions basically have to be microcoded. It was a good design decision for applications where small code-size is important (like embedded, especially back when ARM was first designed), but is definitely not RISC. Predication is arguably not very RISCy either. It's no surprise that both of these were dropped for AArch64, which makes the ISA more like other RISCs. — Peter Cordes, Jul 18 '18 at 09:44
You say *CISC instructions are never directly pipelined in practice*. That's how modern x86 does it, but the Pentium P5 microarchitecture family didn't decode to multiple uops. To tune for P5, *compilers* had to choose a RISCy subset of x86, e.g. avoiding instructions like `push` / `pop` or memory-destination `add` because they didn't pipeline efficiently. P5 is a dual-issue superscalar in-order uarch, with pairing rules about which instructions can run together. See the P5 chapter [in Agner Fog's microarch guide](http://agner.org/optimize/) for details. — Peter Cordes, Jul 18 '18 at 09:50
IDK, maybe you mean *directly* in the sense that early MIPS could use bits of the instruction-word itself as internal control signals in pipeline stages. Agreed that no CISC can do that. But pipelined fetch / decode / execute doesn't require decoding to RISC-like uops internally. — Peter Cordes, Jul 18 '18 at 10:13
You're right about having `memcpy` as a single instruction enabling optimized handling internally. x86's `rep movsb` / `rep movsd` is exactly that, and P6 (PPro / Pentium II) introduced "fast strings" optimized microcode implementation of it and `memset` ([What setup does REP do?](https://stackoverflow.com/q/33902068)). But on CPUs with SIMD vectors, a well-tuned `memcpy` with SSE or AVX vectors can typically beat the microcode. `rep movs` is great for kernel code which can only touch integer registers. — Peter Cordes, Jul 18 '18 at 10:14

CISC machines - don't they just convert complex instructions to RISC?

5 Answers5

Related