Where does the assembler start at (in terms of memory address) when performing address resolutions?

Question

I'm trying to understand a step by step analysis of a two-pass assembler for modern languages and break down the steps line by line.

When it comes to two-pass assemblers, the address resolution is done on the second pass. However, I'm not sure where the assembler picks up specifically in the memory.

Does it just start from the first listing in the generated symbol table? Or does two-pass actually mean two complete readings of the code (from the start)?

Does this question make sense? Sorry if it doesn't.

Back when RAM was a few kilobytes, the source was re-read. Now there's a list data structure in memory. Two-pass generally means the data struct is traversed once to decide where each instruction and data item lives. But in pass 1, jump and memory reference addresses are filled zeros because you may not know them yet. E.g. it's a forward jump to an instruction you haven't seen or a load of data that hasn't been declared (they're farther "down" in the input code). Pass 2 uses a table built in the first class to fill these in. Oddly, some tricky architectures can benefit from more passes. — Gene, May 08 '21 at 16:02

Erik Eidt · Answer 1 · 2021-05-08T17:21:44.337

The simplest "two pass" assembler can literally read the input text twice. Ideally, using the same exact code, it is doing the same work to parse and understand each line of assembly text on both passes, but on the first pass only gathering symbol definitions and counting locations as opposed to generating any machine code. The first pass builds nothing but a symbol table, which is dictionary of key value pairs where a name is the key and a location is the value. The second pass generates the machine code but not the symbol table.

A simple one pass assembler can literally read the input text only once, generating machine code as it can, but remember where forward references occur in the generated machine code (and what kind of reference), since those have to be fixed up aka patched when the referenced symbol's value is finally discovered.

Both these designs suffer if there are variable length instructions, whose optimal length is dependent upon positioning of labels and code. This happens in particular, with instruction sets that have both short and long branch instructions, where the short branch instructions can be used when the delta between the branch instruction's location and the branch target's location is within some size (e.g. fits in 8 bits). If a branch has to change from short to long or vice versa, that can affect code & label locations, which can in turn affect the sizes of other branches.

In these circumstances, I have found the best results (smallest generated code size) by assuming short branches when branch distance is unknown (in code that has a fair amount of branching, assuming larger branches by default will force many to actually require large branches when they otherwise wouldn't have). An assembler attempting optimal (minimal) code size, would have to be prepared to make adjustments broadly in the generated machine code, so parsing to some intermediate, once, and working that intermediate form is probably the best approach, since that will allow for quicker iteration when changes require cascading analysis and adjustment. There are lots of alternatives for an intermediate form, ranging from a node per line of assembly code, to a node per some larger chunk of machine code having at most one potentially varying-size instruction at its end.

Things are slightly more complicated when the assembler generates object files as part of separate (file) compilation, and intended for combination by a linker. In these circumstances, there will be external imported and exported symbols as well as internal symbols. Final fix / patching for externally defined symbols (i.e. used here but defined there) is done by the linker. The output of the assembler (an object file) then contains machine code (and data), fixup records (indicating references to unresolved symbols), imports (names of unresolved symbols), and exports (names & locations of exported symbols).

A simplistic two pass assembler without support for multiple file linking could make the simplifying assumption that all symbols will be resolved by the first pass, and therefore no fixup / patching information needs to be generated or stored. Adding support for multiple file linking eliminates this simplification, which is another reason I prefer the "one pass" assembler design as it already has an inherent notion of fixup records for forward references in the same file, and is more easily adapted to multiple file linking.

Some toolchains for RISC V family of processors (and some other processors) support linker "relaxation". This allows for changing the size of references (such as in code sequences for branches, usually function calls) based on the final location of symbols and code across multiple compilation units. This means that aforementioned notion of chunks of machine code are kept separate in the final compiler / assembler output for adjustment by the linker, and even internal branches within a single object file (in other systems normally fully resolved) remain potentially requiring adjustment (and maybe resizing as well).

In systems of linker relaxation, for cross compilation unit references, typically we would use the largest code sequence in the first place (e.g. by compiler or assembler) and let the relaxing linker shorten code sequences or operand sizes when possible. This is because without relaxation being performed (as might want for debugging), the larger code can still run, i.e. even if some call sites are using more code space than needed.

(There are numerous other features of link time operation & optimization, such as inserting a branch island or thunk on architectures where shorter branch sequences don't actually reach — and then there is the continuing subject of both static loading and dynamic loading (DLLs) in which some symbol resolution is further delayed as are associated fixups).

Link-time relaxation isn't unique to RISC-V. x86-64 GNU/Linux can do it, relaxing `call *foo@GOTPCREL(%rip)` (6 bytes) to `call rel32` (5 bytes) if the target function is found at static link time. (e.g. for `gcc -fno-plt`). The bottom of my answer on [Can't call C standard library function on 64-bit Linux from assembly (yasm) code](https://stackoverflow.com/a/52131094) shows an example, with GAS using a relocation type that allows relaxation, and showing the resulting `addr32` (`67`) prefix on the `call rel32` instruction for padding. — Peter Cordes, May 08 '21 at 18:14
Re: code-size optimization - [Why is the "start small" algorithm for branch displacement not optimal?](https://stackoverflow.com/a/34940237) demos a corner case where the "start small" algorithm you mention doesn't find the optimal solution, due to the presence of things like `align` directives. As well as of course discussing the algorithm in more detail. — Peter Cordes, May 08 '21 at 18:16

score 2 · Answer 2 · answered May 08 '21 at 15:23

Usually an assembler would parse the source into some internal representation of what it means, so it would only have to loop over that again, not redo parsing and match it up with the original results.

i.e. the 2nd pass is just adjusting offset fields in existing data structures!

score 0 · Answer 3 · answered May 08 '21 at 15:40

this is a well argued topic...

one:
   j two
   nop
two:
   jmp one

The first "pass" you come across the label one, before it is used, so you can argue that you have resolved that relative address.

For j two the assembler on the first pass has not yet seen this label so it will have to determine on this pass if it even exists locally or if it is external.

It sees two then sees j one but already knows where one is so on the first pass it can compute the relative offset to one and complete that instruction machine code (assuming this assembly language and instruction set the j means a relative jump).

In order to complete the j two instruction a second "pass" is needed.

Now some folks insist that the definition of "pass" is that you re-read the source code itself as in the old days before we could store the source code in memory as well as symbol tables, partially or completely generated machine code, etc. Today, possibly sadly, it is all resident in memory (prevents having very long machine generated programs like the old days where the program can exceed the memory you have).

Others feel it is perfectly fine to allow pass to define a pass through the data structures and symbol tables to resolve relative addresses. This can often take many passes depending on the assembly language and local vs global labels.

Whether you call it a pass or not, explain how you would not start over at the beginning of the generated tables? You would expect to go linearly beginning to end of whatever code tables you have generated and go linearly through whatever symbol tables you have generated for each unresolved label.

So some folks believe you have to read the source code a second time to call it a two pass assembler. Some folks believe that simply reading through tables generated from the original pass through the code is considered a second pass. Since there has been no reason to read the source code a second time in the last few decades but people keep using the term two pass assembler and keep asking this question would imply that two pass includes reading of generated tables and not just the source code.

But since this is an opinion based topic, you are free to choose your own definition.

Note that a large percentage of computer and programming terms are vague and lack specific definitions. Or worse many folks have a strong opinion on the definition, but there are large percentages that attach themselves to one of the many popular definitions, leaving multiple perfectly valid (and correct) but different definitions.

Where does the assembler start at (in terms of memory address) when performing address resolutions?

3 Answers3