24

From what I understand processor circuitry varies greatly from chip to chip and therefore may require different low level instructions to execute the same high level code. Are all programs eventually converted to assembly language before becoming raw machine code or is this step no longer necessary?

If so, at what point does the processor begin to execute its own unique set of instructions? This is the lowest level of code, so is it at this point that the program instructions are executed by the processor, bit by bit?

Finally, do all architectures have/need an assembly language?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
geg
  • 4,399
  • 4
  • 34
  • 35

8 Answers8

27

Assembly language is, so to say, a human-readable form of expressing the instructions a processor executes (which are binary data and very hard to manage by a human). So, if the machine instructions are not generated by a human, using assembly step is not necessary, though it sometimes does happen for convenience. If a program is compiled from a language such as C++, the compiler may generate machine code directly, without going through the intermediate stage of assembly code. Still, many compilers provide an option of generating assembly code in order to make it easier for a human to inspect what gets generated.

Many modern languages, for example Java and C# are compiled into so-called bytecode. This is code that the CPU does not execute directly, but rather an intermediate form, which may get compiled to machine code just-in-time (JIT-ted) when the program is executed. In such a case, CPU-dependent machine code gets generated but usually without going through human-readable assembly code.

Michał Kosmulski
  • 9,855
  • 1
  • 32
  • 51
10

Assembly language is simply a human-readable, textual representation of the raw machine code. It exists for the benefit of the (human) programmers. It's not at all necessary as an intermediate step to generate machine code. Some compilers do generate assembly and then call an assembler to convert that to machine code. But since omitting that step results in faster compilation (and is not that hard to do), compilers will (broadly speaking) tend to evolve towards generating machine code directly. It is useful to have the option of compiling to assembly though, to inspect the results.

For your last question, assembly language is a human convenience, so no architecture truly needs it. You could create an architecture without one if you really wanted to. But in practice, all architectures have an assembly language. First, it's very easy to create a new assembly language: give a textual name for all your machine opcodes and registers, add some syntax to represent the different addressing modes, and you're already mostly done. And even if all code was directly converted from a higher-level language directly to machine language, you still want an assembly language if only as a way of disassembling and visualizing machine code when hunting for compiler bugs, etc.

Christian Hudon
  • 1,881
  • 1
  • 21
  • 42
  • 1
    A related Q&A for your 2nd paragraph: [Why do we even need assembler when we have compiler?](https://stackoverflow.com/q/51780158) - your answer is basically what I wrote there: we have asm for the benefit of human compiler devs and others to look at, think about, and use when discussing with other humans. (And tools for performance experiments / microbenchmarks.) – Peter Cordes Apr 01 '21 at 02:38
5

Every general purpose CPU has its own instruction set. That is, certain sequences of bytes, when executed, have a well known, documented effect on registers and memory. Assembly language is a convenient way of writing down those instructions, so that humans can read and write them and understand what they do without having to look up commands all the time. It's fairly safe to say that for every modern CPU, an assembly language exists.

Now, about whether programs are converted to assembly. Let's start by saying that CPU does not execute assembly code. It executes machine code, but there's a one-to-one correspondence between machine code commands and assembly lines. As long as you keep that distinction in mind, you can say things like "and now CPU executes a MOV, then an ADD" and so on. CPU executes machine code that corresponds to a MOV command, of course.

That said, if your language compiles to native code, your program is, indeed, converted to machine code before execution. Some compilers (not all) do that by emitting assembly sources and letting the assembler do the final step. This step, when present, is typically well hidden. The assembly representation only exists for a brief time during the compilation process, unless you tell the compiler to keep it intact.

Other compilers don't use an assembly step, but emit assembly if asked to. Microsoft C++, for example, takes an option /FA - emit assembly listing along with an object file.

If it's an interpreted language, then there's no explicit conversion to machine. The source lines are executed by the language interpreter. The bytecode oriented languages (Java, Visual Basic) live somewhere in between; they're compiled to code that is not the same as machine code, but is much easier to interpret than the high level source. For those, it's also fair to say they're not converted to machine code.

Seva Alekseyev
  • 59,826
  • 25
  • 160
  • 281
1

This is a fairly large rabbit hole you're looking down.

That being said, no, not all programs are turned into assembly language. If we exclude just-in-time compilation, interpreted languages liked ruby, lisp, and python, as well as programs that run on a virtual machine (VM) like java and c# are not turned into assembly. Rather there is an existing program, an interpreter or virtual machine that takes in the source(interpreted) or byte-code (VM) (which isn't the assembly language of your computer) and runs them. The interpreter knows what to do when it sees particular sequences of input and takes the right actions, even if it hasn't seen that particular input before.

Compiled programs, like you'd write in C or C++ can, as part of the compilation process, get turned into assembly language that is then turned into machine language. Often this step is skipped to speed things up. Some compilers like LLVM output a generic bitcode so they can split the parts of the compiler that generates the bitcode from the parts that turns bitcode into machine code, allowing reuse across architectures.

However, even though the OS sees the CPU as something that consumes machine code, many CPUs have lower-level micro-code. Each instruction (assmebly-level) in the instruction set is implemented by the CPU as as sequence of simpler micro-code operations. Across CPUs the instruction set can stay the same while the micro-code that implements the instructions changes. Think of the instruction set as an API for the CPU.

Paul Rubel
  • 26,632
  • 7
  • 60
  • 80
  • 1
    Of course, the interpreter itself is composed of machine-code instructions (or is itself JITed or interpreted, until some lower layer consists of ahead-of-time compiled or hand-written machine code.) But yes, an interpreted program is just *data* for the machine instructions of the interpreter program. – Peter Cordes Apr 01 '21 at 02:42
1

All processors operate on bits, we call that machine code, and it can take on very different flavors for different reasons, building a better mouse trap to patents protecting ideas. Every processor uses some flavor of machine code from a users perspective and some internally convert that to microcode, another machine code, and others dont. When you hear x86 vs arm vs mips vs power pc, that is not just company names but they also have their own instruction sets, machine code, for their respective processors. x86 instruction sets although evolving still resemble their history and you can easily pick out x86 code from others. And that is true for all of the companies, you can see the mips legacy in mips and arm in arm, and so on.

So to run a program on a processor at some point it has to be converted into the machine code for that processor and then the processor can handle it. Various languages and tools do it various ways. It is not required for a compiler to compile from the high level language to assembly language, but it is convenient. First off you basically will need an assembler for that processor anyway so the tool is there. Second can be much easier to debug the compiler by looking at human readable assembly language rather than the bits and bytes of machine code. Some compilers like JAVA, python, the old pascal compilers have a universal machine code (each language has its own different one), universal in the sense that java on an x86 and java on an arm do the same thing to that point, then there is a target specific (x86, arm, mips) interpreter that decodes the universal bytecode and executes it on the native processor. But ultimately it has to be the machine code for the processor it is running on.

There is also some history with this method of these compiling layers, I would argue that it is the somewhat unix building block approach, make one block do the front end and another block the backend and output asm and then the asm to object is its own tool and object linked with others is its own tool. Each block can be contained and developed with controlled inputs and outputs, and at times substituted with another block that fits in the same place. Compiler classes teach this model so you will see that replicated with new compilers and new languages. parse the front end, the text of the high level language. Into an intermediate, compiler specific binary code, then on the backend take that internal code and turn it into assembly for the target processor, allowing for example with gcc and many others to change that backend so the front and middle can be reused for different targets. Then separately have an assembler, and also a separate linker, separate tools in their own right.

People keep trying to re-invent the keyboard and mouse, but folks are comfortable enough with the old way that they stick with it even if the new invention is much better. Same is true with compilers and operating systems, and so many other things, we go with what we know and with compilers they often compile to assembly languge.

old_timer
  • 69,149
  • 8
  • 89
  • 168
0

Basically yes, the assembly of Java is called bytecode and any chip's microarchitecture will have an ISA that consists of assembly instructions or something similar, while the same ISA can be realized on many different chips. If you learn MIPS that is a good introduction so that you can learn how C is translated to MIPS by a compiler. Then you can see how a MIPS instruction translates to machine code and that machine code will have an opcode to the ALU that will execute the instruction. For more info you can read Hennessy / Patterson who have written two good book on computer hardware: "Computer Organization & Design" and "Computer Architecture - A Quantitative Approach"

Niklas Rosencrantz
  • 25,640
  • 75
  • 229
  • 424
0

Compilers that produce native machine code do produce the appropriate assembly language that is then assembled into machine code. That is normally done in a single step, but some compilers, like GCC, can output the intermediate assembly as well.

You are correct in that different architectures have differing instruction sets. Utilizing those differences is how compilers can optimize an executable for one processor or another.

Evil Genius
  • 429
  • 5
  • 13
0

Here's some of what may be confusing you:

  • All programs must be converted into machine instructions, because that's what machines execute.
  • An assembler language is a low-level programming language that corresponds almost one to one with machine instructions.
  • A program may either be compiled to machine instructions, or interpreted as machine instructions executed by an interpreter.
  • Programs are not usually converted to assembler language, as that would require that the assembler language be converted into machine instructions. I seem to recall some very old compilers which produced assembler language, but there's no reason I know of to do that sort of thing today.
  • There are multiple ways for machines to execute machine instructions. They may be hard-wired, or they may use microcode. I suspect that almost all modern CPUs use microcode. This is, indeed, magic.
John Saunders
  • 160,644
  • 26
  • 247
  • 397
  • *machine instructions executed by an interpreter.* - The normal terminology doesn't describe interpreter instructions as machine instructions. e.g. you could call `echo foo` in a `#!/bin/sh` script a "machine instruction". Most shells don't even pre-compile to bytecode, but even interpreters that do (like CPython) don't call them *machine* instructions because hardware doesn't run them directly. Maybe "instructions", or "bytecode instructions". (There have been a few CPUs with some hardware-assist for running Java bytecode, and Lisp machines, which blur that line...) – Peter Cordes Apr 01 '21 at 02:56
  • *I suspect that almost all modern CPUs use microcode* - not in the old-school sense of *each* instruction being a sequence of internal steps programmed by a ROM, like 8086 or 6502. In modern x86 CPUs, simple instructions like `add eax, ecx` turn into a single internal uop, and modern RISC CPUs like ARM or AArch64 can run almost every instruction they support as a single internal operation. (That's the core point of the RISC philosophy, after all, allowing easier pipelining.) Modern CPUs use microcode only for complex instructions (like `syscall`) or corner cases (like subnormal FP math). – Peter Cordes Apr 01 '21 at 03:01