0

Intel has been internally decoding CISC instructions to RISC instructions since their Skylake(?) architecture and AMD has been doing so since their K5 processors. So does this mean that the x86 instructions get translated to some weird internal RISC ISA during execution? If that is what is happening, then I wonder if its possible to create a processor that understands (i.e, internally translates to its own proprietary instructions) both x86 and ARM instructions. If that is possible, what would the performance be like? And why hasn't it been done already?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Techically sure, you could. It doesnt make sense to day to use a RISC necessarily internally but more of a VLIW. I think this is what transmeta did implying you could execute either x86 or the actual instruction set directly but I didnt research it that well, it didnt make sense to me for them to not support the VLIW directly. ARM is RISC and would take a performance hit being translated even to a VLIW or microengine. There would be no value in a product like this and the legalities and royalties would be rough as well. – old_timer Aug 16 '20 at 20:00
  • You can see historically what has happened to x86 clones as well as arm clones, so despite there being no value in this product, you wouldnt either be able to produce it in the first place, much less be productive. Just buy an arm or risc-v core and be done with that part of your chip. – old_timer Aug 16 '20 at 20:02
  • Yes microcoded which is not uncommon with CISC means that runtime the instructions are translated to a list of instructions if you will that are then executed, not so much a simulation think more of a look up table of commands. – old_timer Aug 16 '20 at 20:03
  • You could try a transmeta type deal and try to run different instruction sets. and might pull that off as a sort of simulation like qemu rather than a clone, although if you read the documentation from arm for example qemu is illegal, but for some reason has not had those backends shut down. qemu helps arm developers rather than provide competition, but something that could execute better than qemu might get their attention. And you dont want their attention. – old_timer Aug 16 '20 at 20:10
  • 2
    Also understand a processor is not just instructions, there is a lot of protection and other logic in there that is not compatible from one architecture to another so you would have to have that logic to in some form, so you would end up with something so big that it would cost more than an intel chip even if you could mass produce at their volumes if you could even build it at all due to its size, the power numbers would be worse than intel vastly worse than arm. cost more up front, not any faster, power cost is greater.... – old_timer Aug 16 '20 at 20:36
  • 1
    Some VIA CPUs [expose their internal RISC instructions](https://en.wikipedia.org/wiki/Alternate_Instruction_Set) that x86 instructions will be transformed into, so in some sense they also support 2 different ISAs. Some early Itanium CPUs also have hardware support to run x86 code – phuclv Aug 17 '20 at 02:43
  • @phuclv: I was going to answer this, it's actually an interesting question of exactly why it's not as easy as slapping 2 front-ends onto the same back-end pipeline. I put my answer on [ARM vs x86 What are the key differences?](https://stackoverflow.com/a/63444108) since this question got closed before I finished. – Peter Cordes Aug 17 '20 at 03:17
  • @karel: Thanks, copied my answer here with some more intro part. – Peter Cordes Aug 17 '20 at 06:58

2 Answers2

3

The more different the ISAs, the harder it would be. And the more overhead it would cost, especially the back-end. It's not as easy as slapping a different front-end onto a common back-end microarchitecture design.

If it was just a die area cost for different decoders, not other power or perf differences, that would be minor and totally viable these days, with large transistor budgets. (Taking up space in a critical part of the chip that places important things farther from each other is still a cost, but that's unlikely to be a problem in the front-end). Clock or even power gating could fully power down whichever decoder wasn't being used. But as I said, it's not that simple because the back-end has to be designed to support the ISA's instructions and other rules / features; CPUs don't decode to a fully generic / neutral RISC back-end. Related: Why does Intel hide internal RISC core in their processors? has some thoughts and info about what what the internal RISC-like uops are like in modern Intel designs.

Adding ARM support capability to Skylake for example would make it slower and less power-efficient when running pure x86 code, as well as cost more die area. That's not worth it commercially, given the limited market for it, and the need for special OS or hypervisor software to even take advantage of it. (Although that might start to change with AArch64 becoming more relevant thanks to Apple.)

A CPU that could run both ARM and x86 code would be significantly worse at either one than a pure design that only handles one.

  • efficiently running 32-bit ARM requires support for fully predicated execution, including fault suppression for loads / stores. (Unlike AArch64 or x86, which only have ALU-select type instructions like csinc vs. cmov / setcc that just have a normal data dependency on FLAGS as well as their other inputs.)

  • ARM and AArch64 (especially SIMD shuffles) have several instructions that produce 2 outputs, while almost all x86 instructions only write one output register. So x86 microarchitectures are built to track uops that read up to 3 inputs (2 before Haswell/Broadwell), and write only 1 output (or 1 reg + EFLAGS).

  • x86 requires tracking the separate components of a CISC instruction, e.g. the load and the ALU uops for a memory source operand, or the load, ALU, and store for a memory destination.

  • x86 requires coherent instruction caches, and snooping for stores that modify instructions already fetched and in flight in the pipeline, or some way to handle at least x86's strong self-modifying-code ISA guarantees (Observing stale instruction fetching on x86 with self-modifying code).

  • x86 requires a strongly-ordered memory model. (program order + store buffer with store-forwarding). You have to bake this in to your load and store buffers, so I expect that even when running ARM code, such a CPU would basically still use x86's far stronger memory model. (Modern Intel CPUs speculatively load early and do a memory order machine clear on mis-speculation, so maybe you could let that happen and simply not do those pipeline nukes. Except in cases where it was due to mis-predicting whether a load was reloading a recent store by this thread or not; that of course still has to be handled correctly.)

    A pure ARM could have simpler load / store buffers that didn't interact with each other as much. (Except for the purpose of making stlr / ldapr / ldar release / acquire / acquire-seq-cst cheaper, not just fully stalling.)

  • Different page-table formats. (You'd probably pick one or the other for the OS to use, and only support the other ISA for user-space under a native kernel.)

  • If you did try to fully handle privileged / kernel stuff from both ISAs, e.g. so you could have HW virtualization with VMs of either ISA, you also have stuff like control-register and debug facilities.

Update: Apple M1 does support a strong x86-style TSO memory model, allowing efficient+correct binary translation of x86-64 machine code into AArch64 machine code, without needing to use ldapr / stlr for every load and store. It also has a weak mode for running native AArch64 code, toggleable by the kernel.

In Apple's Rosetta binary translation, software handles all the other issues I mentioned; the CPU is just executing native AArch64 machine code. (And Rosetta only handles user-space programs, so there's no need to even emulate x86 page-table formats and semantics like that.)


This already exists for other combinations of ISAs, notably AArch64 + ARM, but also x86-64 and 32-bit x86 have slightly different machine code formats, and a larger register set. Those pairs ISAs were of course designed to be compatible, and for kernels for the new ISA to have support for running the older ISA as user-space processes.

At the easiest end of the spectrum, we have x86-64 CPUs which support running 32-bit x86 machine code (in "compat mode") under a 64-bit kernel. They fully use the same pipeline fetch/decode/issue/out-of-order-exec pipeline for all modes. 64-bit x86 machine code is intentionally similar enough to 16 and 32-bit modes that the same decoders can be used, with only a few mode-dependent decoding differences. (Like inc/dec vs. REX prefix.) AMD was intentionally very conservative, unfortunately, leaving many minor x86 warts unchanged for 64-bit mode, to keep decoders as similar as possible. (Perhaps in case AMD64 didn't even catch on, they didn't want to be stuck spending extra transistors that people wouldn't use.)

AArch64 and ARM 32-bit are separate machine-code formats with significant differences in encoding. e.g. immediate operands are encoded differently, and I assume most of the opcodes are different. Presumably pipelines have 2 separate decoder blocks, and the front-end routes the instruction stream through one or the other depending on mode. Both are relatively easy to decode, unlike x86, so this is presumably fine; neither block has to be huge to turn instructions into a consistent internal format. Supporting 32-bit ARM does mean somehow implementing efficient support for predication throughout the pipeline, though.

Early Itanium (IA-64) also had hardware support for x86, defining how the x86 register state mapped onto the IA-64 register state. Those ISAs are completely different. My understanding was that x86 support was more or less "bolted on", with a separate area of the chip dedicated to running x86 machine code. Performance was bad, worse than good software emulation, so once that was ready the HW designs dropped it. (https://en.wikipedia.org/wiki/IA-64#Architectural_changes)

So does this mean that the x86 instructions get translated to some weird internal RISC ISA during execution?

Yes, but that "RISC ISA" is not similar to ARM. e.g. it has all the quirks of x86, like shifts leaving FLAGS unmodified if the shift count is 0. (Modern Intel handles that by decoding shl eax, cl to 3 uops; Nehalem and earlier stalled the front-end if a later instruction wanted to read FLAGS from a shift.)

Probably a better example of a back-end quirk that needs to be supported is x86 partial registers, like writing AL and AH, then reading EAX. The RAT (register allocation table) in the back-end has to track all that, and issue merging uops or however it handles it. (See Why doesn't GCC use partial registers?).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
0

Short ans. Yes, can be done. See/Google "mainframe microcode". Yes, has been done with mainframes and minis. Because cpus these days are highly optimized for their own architecture, unlikely for good performance if alternate microcode. Experience shows that emmulation of cpu x by cpu y in microcode is a non-trivial issue. You ultimately need to know more about both cpus than the original designers. And heaven help you with mask variations. Better to write higher level emmulaters. Voice of experience.

sys101
  • 19
  • 3