A lexical token or simply token is a pair consisting of a token name and an optional token value.
I've spend a lot of time trying to figure out wether a token value is supposed to be a string or if it could be something else like a lookup index as well.
This is an example of a lookup index. Consider the following literal:
99
Which would produce the following token:
(Note that 315
is the token's lookup index used to query the lookup table)
sample-token
<integer-literal, 315>
This token is then tightly coupled with it's lookup table containig the actual token value:
integer-lookup-table
...
<315, 99>
...
Integers are the most basic example. The logic gets considerably more complex with string literals for example. Storing lookup indexes in string literal tokens would require the Lexer to resolve the contents of a string (escape sequences, etc). Is this still the Lexer's job or should that be delayed to a later stage like semantic analysis? There is a simmilar question in which string escaping during lexical analysis is addressed.
I can think of the following pros and cons for using such lookup indexes:
Pros
- Later stages of the compiler don't need to know the details about the literal and can refer to it using a number.
- Smaller memory footprint (strings are usually heap allocated). Each literal is allocated once instead of
N
times. - The compiler can better optimize due to simple numeric values instead of long strings.
Cons
- The Lexer's job gets considerable harder. Semantic validation needs to be performed partially by the Lexer. (
"\z"
might be an invalid escape) - Changes to the grammer of literals requries the Lexer to be changed. Which, as I learned, should be avoided.