ASM code generation look ahead and complexity

Question

I have an AST with nodes like funcDef, funcCall, literals, and so on. I'm working on the code generator part of my compiler, generating assembly for x86_64. My question is what is the "proper" (industry standard) way of code generating down an AST: is it multiple passes, for example one to figure out how many local variables will be required so as to use that value for decrementing the stack pointer? In addition, how are complex ASTs such as a funcCall with an argument being another funcCall and so on resolved into ASM?

Is it wise to convert my AST to a very simple IR (SSA?)? Also, as I'm very ignorant in code generation theory, are there any good books that focus on this?

Thanks!

There's no standard answer. You as the compiler writer set the goals for your compiler. The answer depends on the goals. The smallest, fastest, simplest compilers needing very little memory emit code directly from the parser in one pass, but not of high quality. The most highly optimizing compilers use several intermediate forms, auxiliary data structures, and many passes. And they're quite complex. If you state your goals, it might be possible to set boundaries on reasonable approaches to reaching them. — Gene, Feb 24 '18 at 22:15
@Gene: I'm looking for the simplest possible solution (I'm doing this to learn the basics). How is a compiler that runs in one pass possible? If you have an AST like OP with funcCall as a variable in a funcCall you would need to know so as to be able to `movq %rax, rdi` before the parent call wouldn't you? Again I'm sorry for the dumb questions, I really am just trying to learn. Thanks! — C. Carey, Feb 24 '18 at 22:22
For variables, there are typically at least two phases — SSA form generation and register allocation, only the latter is concerned of target host details, as the former can be architecture-independent. The assembling phase itself also has at least two passes in order to resolve forward jump targets in a simplest manner. It is very hard to answer your question in more details because you ask about a very broad topic, basically about a two thirds of a compiler design. — Grigory Rechistov, Feb 24 '18 at 22:22
One-pass compilers use a technique called back-patching. The parser uses list-valuted attributes that store locations in already-emitted code that need to be updated when later structures are parsed. These days the extra memory needed to build a syntax tree is so cheap that back-patching is rarely used. It's conceptually simplest to build an AST and traverse it one time to emit code. An extra IR is only needed if you have specific goals regarding code quality. — Gene, Feb 24 '18 at 22:27
Regarding actual code generation. if for instance you target x86 based Intel processors you might wish to read the [ia 32 architectures optimization manual](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf) — Michael Petch, Feb 25 '18 at 00:04
You may want to take a look at LLVM tool-chain. It takes a SSA (mostly) IR and can generate all number of architecture machine code. — Frank C., Feb 25 '18 at 01:35

Peter Cordes · Answer 1 · 2018-02-25T03:16:21.380

The "proper" / standard way to generate machine code is with an optimizing compiler that transforms through an internal representation (often an SSA form) and looks very hard for all kinds of optimizations.

An interpreter is easier to write, and if written well can give better performance than inefficiently / naively-generated asm, so there's no standard "simple" way to generate asm, because nobody wants that. (Except as a hobby project to teach themselves about compilers, I guess.)

Writing a good compiler yourself would be decades of work. See Why are there so few C compilers?, especially Basile Starynkevitch's answer. That would be true even for "simple" CPUs with less complex behaviour than modern x86-64; optimizing away redundant work, and deciding when to inline functions, and so on, is not easy.

But optimizing for modern x86-64 ranges from easy (out-of-order execution doesn't care very much about instruction ordering) to arcane (e.g. inc eax saves code size vs. add eax,1, but on some CPUs in some cases it's slower; either multiple uops or a partial-flag stall). Or that 3-component LEA has higher latency (but maybe better throughput) than 2 separate LEA / ADD instructions on Intel Sandybridge-family CPUs. See also Agner Fog's optimization guides and other performance-optimization links in the x86 tag wiki. It's only worth worrying about this if you're going to try at all to optimize. Efficiently doing a lot of redundant work is not that useful.

To make a compiler for a new language, you can just write a front-end that generates LLVM-IR and feeds that to the LLVM library for it to optimize and emit asm or machine code for. (You can do the same thing for GIMPLE, using gcc's optimizing middle / back-end instead of LLVM). As a bonus, your compiler hopefully works on most of the CPU architectures that LLVM or gcc support, not just x86.

See this Implementing a Language with LLVM tutorial for example.

Naively transliterating every part of every expression into asm instructions separately will produce slow and bloated asm. Perhaps similar to what you get from clang -O0, but it does optimize within expressions, so 10 + x + 4 is still compiled the same as x + 14. clang -O0 also has the added burden of spilling everything to memory after every statement, so you can modify C variables in memory with a debugger at any breakpoint. (This is part of what -O0 means: guarantee consistent debugging, as well as compile fast with minimal effort spent on optimization.)

A naive compiler that didn't care about that could potentially keep track of which values were live in which register and spill an old one when a new register was needed. This could easily be terrible if you don't look ahead to which values are needed soon, to prefer keeping those live in registers.

If you don't care about the quality of the asm you generate, then sure, do whatever is convenient.

TinyCC is a one-pass C compiler. When it emits a function prologue, it hasn't decided yet how many bytes of stack space it needs to reserve. (It comes back and fills that in once it reaches the end of a function.) See Tiny C Compiler's generated code emits extra (unnecessary?) NOPs and JMPs for an interesting consequence of that: a nop to pad one version of its function prologue.

IDK what it does internally, but presumably as it encounters new variable declarations it tacks them on to the end of the stack frame it's going to reserve (thus not changing the offset from rbp to any of the existing variables, because it may have already emitted using them).

TCC is open source, and written to be small / simple (and compile fast), not to create good asm, so you might want to have a look at its source code to see what it does.

ASM code generation look ahead and complexity

1 Answers1