12

From Thinking in C++ - Vol 1:

In the second pass, the code generator walks through the parse tree and generates either assembly language code or machine code for the nodes of the tree.

Well at least in GCC if we give the option of generating the assembly code, the compiler obeys by creating a file containing assembly code. But, when we simply run the command gcc without any options does it not produce the assembly code internally?

If yes, then why does it need to first produce an assembly code and then translate it to machine language?

Aquarius_Girl
  • 21,790
  • 65
  • 230
  • 411
  • Assembly language is just a plain-text version of machine code. It is easier to read, but there is a 1:1 correspondence. – Dan Byström Dec 26 '12 at 11:25
  • 1
    @DanByström thanks, but that's not the question. – Aquarius_Girl Dec 26 '12 at 11:26
  • Note: not every compiler generates code for a *physical* machine. There are compilers that generate code for a *virtual* machine, such as P-code, or the code ran by a flash application, or maybe even the code ran by an executor for a query-engine. Just-in-time-code is another exception, it may or may not be compiled to physical opcodes for a physical machine. Generally speaking: any intermediate representation of the code may exist at some stage of the compilation process. – wildplasser Dec 26 '12 at 12:36
  • An example is dmd compiler that doesn't generate assembly code. – Jack Jan 04 '13 at 03:35
  • What is the correct answer to this question ? – Jake Apr 05 '14 at 15:52

4 Answers4

12

TL:DR different object file formats / easier portability to new Unix platforms (historically) is one of the main reasons for gcc keeping the assembler separate from the compiler, I think. Outside of gcc, the mainstream x86 C and C++ compilers (clang/LLVM, MSVC, ICC) go straight to machine code, with the option of printing asm text if you ask them to.

LLVM and MSVC are / come with complete toolchains, not just compilers. (Also come with assembler and linker). LLVM already has object-file handling as a library function, so it can use that instead of writing out asm text to feed to a separate program.

Smaller projects often choose to leave object-file format details to the assembler. e.g. FreePascal can go straight to an object file on a few of its target platforms, but otherwise only to asm. There are many claims (1, 2, 3, 4) that almost all compilers go through asm text, but that's not true for many of the biggest most-widely-used compilers (except GCC) that have lots of developers working on them.

C compilers tend to either target a single platform only (like a vendor's compiler for a microcontroller) and were written as "the/a C implementation for this platform", or be very large projects like LLVM where including machine code generation isn't a big fraction of the compiler's own code size. Compilers for less widely used languages are more usually portable, but without wanting to write their own machine-code / object-file handling. (Many compilers these days are front-ends for LLVM, so get .o output for free, like rustc, but older compilers didn't have that option.)

Out of all compilers ever, most do go to asm. But if you weight by how often each one is used every day, going straight to a relocatable object file (.o / .obj) is significant fraction of the total builds done on any given day worldwide. i.e. the compiler you care about if you're reading this might well work this way.

Also, compilers like javac that target a portable bytecode format have less reason to use asm; the same output file and bytecode format work across every platform they have to run on.

Related:


Why GCC does what it does

Yes, as is a separate program that the gcc front-end actually runs separately from cc1 (the C preprocessor+compiler that produces text asm).

This makes gcc slightly more modular, making the compiler itself a text -> text program.

GCC internally uses some binary data structures for GIMPLE and RTL internal representations, but it doesn't write (text representations of) those IR formats to files unless you use a special option for debugging.

So why stop at assembly? This means GCC doesn't need to know about different object file formats for the same target. For example, different x86-64 OSes use ELF, PE/COFF, MachO64 object files, and historically a.out. as assembles the same text asm into the same machine code surrounded by different object file metadata on different targets. (There are minor differences gcc has to know about, like whether to prepend an _ to symbol names or not, and whether 32-bit absolute addresses can be used, and whether code has to be PIC.)

Any platform-specific quirks can be left to GNU binutils as (aka GAS), or gcc can use the vendor-supplied assembler that comes with a system.

Historically, there were many different Unix systems with different CPUs, or especially the same CPU but different quirks in their object file formats. And more importantly, a fairly compatible set of assembler directives like .globl main, .asciiz "Hello World!\n", and similar. GAS syntax comes from Unix assemblers.

It really was possible in the past to port GCC to a new Unix platform without porting as, just using the assembler that comes with the OS.

Nobody has ever gotten around to integrating an assembler as a library into GCC's cc1 compiler. That's been done for the C preprocessor (which historically was also done in a separate process), but not the assembler.


Most other compilers do produce object files directly from the compiler, without a text asm temporary file / pipe. Often because the compiler was only designed for one or a couple targets, like MSVC or ICC or various compilers that started out as x86-only, or many vendor-supplied compilers for embedded chips.

clang/LLVM was designed much more recently than GCC. It was designed to work as an optimizing JIT back-end, so it needed a built-in assembler to make it fast to generate machine code. To work as an ahead-of-time compiler, adding support for different object-file formats was presumably a minor thing since the internal software architecture was there to go straight to binary machine code.

LLVM of course uses LLVM-IR internally for target-independent optimizations before looking for back-end-specific optimizations, but again it only writes out this format as text if you ask it to.


Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • _compilers like javac that target a portable bytecode format have less reason to use asm_ - the Java byte code is very high level so the assembly stage does not really apply. – Thorbjørn Ravn Andersen May 02 '21 at 07:11
  • also to my understanding gcc started using the assembler provided by the Unix vendor instead of having to bring its own. This was one of the reasons that a binary distribution was made of gcc for Solaris as the compiler tool chain was not part of the basic operating system. – Thorbjørn Ravn Andersen May 02 '21 at 07:17
  • @ThorbjørnRavnAndersen: Indeed, that's a good example of GCC working with a vendor-supplied assembler / linker like I mentioned, instead of needing GNU `as` / `ld` too, to support the file formats. Re: Java bytecode: it's still a binary format with instructions, and one can make a text representation of it (e.g. for debugging purposes). The main reason it's not useful to separate the parser/compiler logic from writing `.class` files is that the `.class` file format is portable; there's no need to consider writing Java bytecode into different formats of binary files. Not that it's high-level. – Peter Cordes May 02 '21 at 07:24
4

The assembler stage can be justified by two reasons:

  • it allows c/c++ code to be translated to a machine independent abstract assembler, from which there exists easy conversions to a multitude of different instruction set architectures
  • it takes out the burden of validating correct opcode, prefix, r/m, etc. instruction encoding for CISC architectures, when one can utilize an existing software [component].

The 1st edition of that book is from 2000, but is may as well talk about the early 90's, when c++ itself was translated to c and when the gnu/free software idea (including source code for compilers) was not really known.

EDIT: One of several nonsensical abstract machine independent languages used by GCC is RTL -- Register Transfer Language.

Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
  • 4
    "Machine independent abstract assembler" is just nonsense. – Ira Baxter Dec 26 '12 at 12:28
  • 1
    This doesn't explain why `as` is a separate program that the `gcc` front-end actually runs separately from `cc1` (the C -> asm preprocessor+compiler). Sure gcc uses GIMPLE and RTL internally, but it doesn't write text representations of those IR formats to files unless you use a special option for debugging. LLVM uses LLVM-IR internally, and also has a built-in assembler that knows about different object-file formats for each target (ELF, PE/COFF, MachO64 on x86-64, etc.) Object file formats are one of the main reasons for keeping the assembler separate, AFAIK. – Peter Cordes Dec 17 '18 at 14:57
3

It's a matter of compiler implementation. Assembly code is an intermediate step between higher-level language (the one being compiled) and the resulting binary output. In general it's easier first to convert to assembly and after that to binary code instead of directly creating the binary code.

SomeWittyUsername
  • 18,025
  • 3
  • 42
  • 85
  • @AnishaKaul For compiled program the translation is performed only once (so it's acceptable) and for interpreted program the translation is performed every time it's executed – SomeWittyUsername Dec 26 '12 at 11:44
  • Easier to debug by reading ascii than bits. And if you already have a reliable tool that goes from asm to object, use it (the unix way, build layers on top of other tools). – old_timer Dec 26 '12 at 13:02
2

Gcc does create the assembly code as a temporary file, calls the assembler, and maybe the linker depending on what you do or dont add on the command line. That makes an object and then if enabled the binary, then all the temporary files are cleaned up. Use -save-temps to see what is really going on (there are a number of temporary files).

Running gcc without any options absolutely creates an asm file.

There is no "need" for this, it is simply how they happened to design it. I assume for multiple reasons, you will already want/need an assembler and linker before you start on a compiler (cart before the horse, asm on a processor before some other language). "The unix way" is to not re-invent tools or libraries, but just add a little on top, so that would imply going to asm then letting the assembler and linker do the rest. You dont have to re-invent so much of the assemblers job that way (multiple passes, resolving labels, etc). It is easier for a developer to debug ascii asm than bits. Folks have been doing it this way for generations of compilers. Just in time compilers are the primary exception to this habit, by definition they have to be able to go to machine code, so they do or can. Only recently though did llvm provide a way for the command line tools (llc) to go straight to object without stopping at asm (or at least it appears that way to the user).

old_timer
  • 69,149
  • 8
  • 89
  • 168