70

I'm primarily interested in popular and widely used compilers, such as gcc. But if things are done differently with different compilers, I'd like to know that, too.

Taking gcc as an example, does it compile a short program written in C directly to machine code, or does it first translate it to human-readable assembly, and only then uses an (in-built?) assembler to translate the assembly program into binary, machine code -- a series of instructions to the CPU?

Is using assembly code to create a binary executable a significantly expensive operation? Or is it a relatively simple and quick thing to do?

(Let's assume we're dealing with only the x86 family of processors, and all programs are written for Linux.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Related: [Does a compiler always produce an assembly code?](https://stackoverflow.com/q/14039843) - no, big mainstream C compilers that provide a complete toolchain often go straight to machine code, especially ones (unlike GCC) that only target a few ISAs / object file formats. But yes, compilers with smaller dev teams often leave the object-file handling to an existing assembler. Also related: [What do C and Assembler actually compile to?](https://stackoverflow.com/q/2135788) – Peter Cordes Aug 07 '20 at 15:07

14 Answers14

66

gcc actually produces assembler and assembles it using the as assembler. Not all compilers do this - the MS compilers produce object code directly, though you can make them generate assembler output. Translating assembler to object code is a pretty simple process, at least compared with C→Assembly or C→Machine-code translation.

Some compilers produce other high-level language code as their output - for example, cfront, the first C++ compiler, produced C as its output which was then compiled to machine code by a C compiler.

Note that neither direct compilation or assembly actually produce an executable. That is done by the linker, which takes the various object code files produced by compilation/assembly, resolves all the names they contain and produces the final executable binary.

iono
  • 2,575
  • 1
  • 28
  • 36
  • 4
    Some historical compilers used to produce executables directly. Some could even write an executable .COM file in a single pass during compilation [following the code for each procedure, the compiler could output a list of patch-points within that procedure along with the address of the previous procedure's patch-point list; startup code could make all of the necessary patches when the code was loaded]. This made rapid compilation possible in a very small memory footprint, even when using floppy disks. – supercat Mar 25 '14 at 21:18
  • 2
    If MS compilers produce object code directly. Does that mean they have its own transformation process or they just transform in ram to assembly and then to object code, without saving the assemblycode as file and use that file as a next input? – S. John Fagone May 06 '20 at 08:33
17

Almost all compilers, including gcc, produce assembly code because it's easier---both to produce and to debug the compiler. The major exceptions are usually just-in-time compilers or interactive compilers, whose authors don't want the performance overhead or the hassle of forking a whole process to run the assembler. Some interesting examples include

  • Standard ML of New Jersey, which runs interactively and compiles every expression on the fly.

  • The tinycc compiler, which is designed to be fast enough to compile, load, and run a C script in well under 100 milliseconds, and therefore doesn't want the overhead of calling the assembler and linker.

What these cases have in common is a desire for "instantaneous" response. Assemblers and linkers are plenty fast, but not quite good enough for interactive response. Yet.

There are also a large family of languages, such as Smalltalk, Java, and Lua, which compile to bytecode, not assembly code, but whose implementations may later translate that bytecode directly to machine code without benefit of an assembler.

(Footnote: in the early 1990s, Mary Fernandez and I wrote the New Jersey Machine Code Toolkit, for which the code is online, which generates C libraries that compiler writers can use to bypass the standard assembler and linker. Mary used it to roughly double the speed of her optimizing linker when generating a.out. If you don't write to disk, speedups are even greater...)

Norman Ramsey
  • 198,648
  • 61
  • 360
  • 533
  • clang/LLVM, MSVC, and ICC all produce machine code directly. GCC is the exception, not the rule, among mainstream C/C++ compilers, at least for x86. These days, many compilers are implemented as front-ends for LLVM. – Peter Cordes Jul 02 '20 at 20:38
  • 1
    @PeterCordes please notice the date on my answer. The world has changed! – Norman Ramsey Aug 07 '20 at 14:14
  • Clang didn't exist in 2009, but I think my point was still mostly true for big mainstream C++ implementations back then. Many compilers for other languages do leave the object file format handling to a separate assembler so this answer isn't wrong, just ignoring a few C++ compilers that get used more than many other smaller compilers combined. Or in other words, this answer could use some maintenance. (See also [Does a compiler always produce an assembly code?](https://stackoverflow.com/q/14039843) for my attempt at answering basically a duplicate.) – Peter Cordes Aug 07 '20 at 15:12
7

According to chapter 2 of Introduction to Reverse Engineering Software (by Mike Perry and Nasko Oskov), both gcc and cl.exe (the back end compiler for MSVC++) have the -S switch you can use to output the assembly that each compiler produces.

You can also run gcc in verbose mode (gcc -v) to get a list of commands that it executes to see what it's doing behind the scenes.

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
  • `gcc` internally does truly compile to a temporary `.s` asm file, and runs `as` on it. The `-S` option just stops there. MSVC on the other hand normally only outputs a `.obj`, and it's assembly-output option makes a huge bloated `.asm` file (with definitions of templates you never called, or of library functions) that sometimes needs trimming down to even assemble + link correctly without duplicate-symbol errors. GCC does compile to asm in a very real sense during normal operation, MSVC doesn't. (Neither do ICC or clang/LLVM, but they can output asm that matches their .o) – Peter Cordes Aug 07 '20 at 15:18
7

GCC compiles to assembler. Some other compilers don't. For example, LLVM-GCC compiles to LLVM-assembly or LLVM-bytecode, which is then compiled to machine code. Almost all compilers have some sort of internal representation, LLVM-GCC use LLVM, and, IIRC, GCC uses something called GIMPLE.

Zifre
  • 26,504
  • 11
  • 85
  • 105
  • True, but only GCC (out of big mainstream C/C++ compilers) actually writes asm as text to a file. GIMPLE is only dumped to a file as text if you use a debugging option, otherwise it's only represented via non-text data structures inside GCC's `cc1`. Similarly for LLVM-IR; it's probably never serialized into bytecode, let alone text, just passed around as data structures between the clang front-end and the LLVM back-end and its optimizer passes. I've heard of LLVM-GCC but IDK how it works. I guess you're saying it outputs a `.ll` of LLVM-IR, and runs llvm-as on it to optimize into a `.o`. – Peter Cordes Aug 07 '20 at 15:23
6

Compilers, in general, parse the source code into an Abstract Syntax Tree (an AST), then into some intermediate language. Only then, usually after some optimizations, they emit the target language.

About gcc, it can compile to a wide variety of targets. I don't know if for x86 it compiles to assembly first, but I did give you some insight onto compilers - and you asked for that too.

Asaf R
  • 6,880
  • 9
  • 47
  • 69
4

None of the answers clarifies the fact that an ASSEMBLER is the first layer of abstraction between BINARY CODE and MACHINE DEPENDENT SYMBOLIC CODE. A compiler is the second layer of abstraction between MACHINE DEPENDENT SYMBOLIC CODE and MACHINE INDEPENDENT SYMBOLIC CODE.

If a compiler directly converts code to binary code, by definition, it will be called assembler and not a compiler.

It is more appropriate to say that a compiler uses INTERMEDIATE CODE which may or may not be assembly language e.g. Java uses byte code as intermediate code and byte code is assembler for java virtual machine (JVM).

EDIT: You may wonder why an assembler always produces machine dependent code and why a compiler is capable of producing machine independent code. The answer is very simple. An assembler is direct mapping of machine code and therefore assembly language it produces is always machine dependent. On the contrary, we can write more than one versions of a compiler for different machines. So to run our code independently of machine, we must compile same code but on the compiler version written for that machine.

Bubba Yakoza
  • 749
  • 2
  • 9
  • 17
  • *If a compiler directly converts code to binary code, by definition, it will be called assembler and not a compiler.* - Tell that to [tcc, the Tiny C Compiler](https://bellard.org/tcc/), which very directly compiles C source into x86 machine code, with out even an internal representation like GIMPLE or LLVM bytecode used internally. It's definitely *not* an assembler because its input is portable C. – Peter Cordes Oct 15 '20 at 12:17
  • Even clang/LLVM never actually creates a file containing asm text or LLVM bytecode, but it does have internal data structures that represent target-neutral LLVM "instructions" during optimization. Perhaps also ones that represent machine instructions in the final stages of optimization and codegen. – Peter Cordes Oct 15 '20 at 12:18
3

Some of the above answers confused me because in some answers GCC(GNU Compiler Collection) is mentioned as a single tool but it's a suite of tools like GNU Assembler(also known as GAS), linker, compiler and debugger which are used together to produce an executable. And yes, GCC doesn't directly converts the C source file to machine code.

It does that in 4 steps:

  1. Pre-processing - Removing of comments and expanding macros(of C).etc
  2. Compilation - Source to Assembly(done by compiler)
  3. Assembling - Assembly to Machine Code(done by Assembler)
  4. Linking - By default linking standard functions dynamically to shared libraries(done by linker)
General Grievance
  • 4,555
  • 31
  • 31
  • 45
  • GCC's C and C++ compilers combine C pre-processing and actual compilation to asm into a single step, done by `/usr/lib/gcc/x86_64-pc-linux-gnu/10.1.0/cc1` or `cc1plus` for example. This has been the case for many years. Decades ago CPP was a separate step that produced a temp file, but that's no longer the case. Then yes, asm->object files is done with (usually) `as` from GNU Binutils (a separately-maintained package than GCC), and then linking with `ld` (also from Binutils). – Peter Cordes Oct 15 '20 at 12:06
  • 1
    GDB is yet another separate program, and is not involved at all in how the `gcc` front-end turns source into a linked executable. – Peter Cordes Oct 15 '20 at 12:09
  • Perhaps what you meant was that GDB's source code is in the same repository as GNU Binutils. This is true, although they're normally packaged separately. And is irrelevant to building executables. – Peter Cordes Jul 27 '22 at 16:13
1

There are many phases of compilation. In abstract, there is the front end that reads the source code, breaks it up into tokens and finally into a parse tree.

The back end is responsible for first generating a sequential code like three address code eg:

code:

x = y + z + w

into:

reg1 = y + z
x = reg1 + w

Then optimizing it, translating it into assembly and finally into machine language. All steps are layered carefully so that when needed, one of them can be replaced

Cosmin
  • 21,216
  • 5
  • 45
  • 60
jack
  • 11
  • 1
1

You'd probably be interested to listen to this pod cast: Internals of GCC

Eric Andrew Lewis
  • 1,256
  • 2
  • 13
  • 22
Paul Hollingsworth
  • 13,124
  • 12
  • 51
  • 68
1

In most multi-pass compilers assembly language is generated during the code generation steps. This allows you to write the lexer, syntax and semantic phases once and then generate executable code using a single assembler back end. this is used a lot in cross compilers such a C compilers that generates for a range of different cpu's.

Just about every compiler has some form of this wheter its an implicit or explicity step.

MikeJ
  • 14,430
  • 21
  • 71
  • 87
0

Although all compilers not convert the source code into an intermediate level code but there is a bridge of taking the source code to machine level code in several compilers

0

A listing file is a compiler-generated text file that contains the assembly language code produced by the compiler.Most compilers support the generation of listing files during the compilation process. For some compilers, such as GCC, this is a standard part of the compilation process because the compiler doesn’t directly generate an object file, but instead generates an assembly language file which is then processed by an assembler. In such compilers, requesting a listing file simply means that the compiler must not delete it after the assembler is done with it. In other compilers (such as the Microsoft or Intel compilers), a listing file is an optional feature that must be enabled through the command line.

0

Visual C++ has a switch to output assembly code, so I think it generates assembly code before outputting machine code.

friol
  • 6,996
  • 4
  • 44
  • 81
  • No, MSVC's asm output is not something it actually generates if you don't ask for it. And unlike LLVM, It's not even a real reflection of exactly what it would put in an object file if you did just compile. (e.g. if you assemble its output with MASM, you'll get a different .obj. The compiler asm output adds extra definitions for functions you didn't use. I think I've read that you sometimes even get link errors if you try to separately compile + assemble + link instead of just compiling + linking with MSVC.) – Peter Cordes Oct 15 '20 at 12:12
0

Java compilers compile to java byte code (binary format) and then run this using a virtual machine (jvm).

Whilst this may seem slow it - it can be faster because the JVM can take advantage of later CPU instructions and new optimizations. A C++ compiler won't do this - you have to target the instruction set at compile time.

Fortyrunner
  • 12,702
  • 4
  • 31
  • 54