The "proper" / standard way to generate machine code is with an optimizing compiler that transforms through an internal representation (often an SSA form) and looks very hard for all kinds of optimizations.
An interpreter is easier to write, and if written well can give better performance than inefficiently / naively-generated asm, so there's no standard "simple" way to generate asm, because nobody wants that. (Except as a hobby project to teach themselves about compilers, I guess.)
Writing a good compiler yourself would be decades of work. See Why are there so few C compilers?, especially Basile Starynkevitch's answer. That would be true even for "simple" CPUs with less complex behaviour than modern x86-64; optimizing away redundant work, and deciding when to inline functions, and so on, is not easy.
But optimizing for modern x86-64 ranges from easy (out-of-order execution doesn't care very much about instruction ordering) to arcane (e.g. inc eax
saves code size vs. add eax,1
, but on some CPUs in some cases it's slower; either multiple uops or a partial-flag stall). Or that 3-component LEA has higher latency (but maybe better throughput) than 2 separate LEA / ADD instructions on Intel Sandybridge-family CPUs. See also Agner Fog's optimization guides and other performance-optimization links in the x86 tag wiki. It's only worth worrying about this if you're going to try at all to optimize. Efficiently doing a lot of redundant work is not that useful.
To make a compiler for a new language, you can just write a front-end that generates LLVM-IR and feeds that to the LLVM library for it to optimize and emit asm or machine code for. (You can do the same thing for GIMPLE, using gcc's optimizing middle / back-end instead of LLVM). As a bonus, your compiler hopefully works on most of the CPU architectures that LLVM or gcc support, not just x86.
See this Implementing a Language with LLVM tutorial for example.
Naively transliterating every part of every expression into asm instructions separately will produce slow and bloated asm. Perhaps similar to what you get from clang -O0
, but it does optimize within expressions, so 10 + x + 4
is still compiled the same as x + 14
. clang -O0
also has the added burden of spilling everything to memory after every statement, so you can modify C variables in memory with a debugger at any breakpoint. (This is part of what -O0
means: guarantee consistent debugging, as well as compile fast with minimal effort spent on optimization.)
A naive compiler that didn't care about that could potentially keep track of which values were live in which register and spill an old one when a new register was needed. This could easily be terrible if you don't look ahead to which values are needed soon, to prefer keeping those live in registers.
If you don't care about the quality of the asm you generate, then sure, do whatever is convenient.
TinyCC is a one-pass C compiler. When it emits a function prologue, it hasn't decided yet how many bytes of stack space it needs to reserve. (It comes back and fills that in once it reaches the end of a function.) See Tiny C Compiler's generated code emits extra (unnecessary?) NOPs and JMPs for an interesting consequence of that: a nop
to pad one version of its function prologue.
IDK what it does internally, but presumably as it encounters new variable declarations it tacks them on to the end of the stack frame it's going to reserve (thus not changing the offset from rbp
to any of the existing variables, because it may have already emitted using them).
TCC is open source, and written to be small / simple (and compile fast), not to create good asm, so you might want to have a look at its source code to see what it does.