Should a Lexer resolve and index literals

Question

A lexical token or simply token is a pair consisting of a token name and an optional token value.

I've spend a lot of time trying to figure out wether a token value is supposed to be a string or if it could be something else like a lookup index as well.

This is an example of a lookup index. Consider the following literal:

Which would produce the following token:
(Note that 315 is the token's lookup index used to query the lookup table)

sample-token
    <integer-literal, 315>

This token is then tightly coupled with it's lookup table containig the actual token value:

integer-lookup-table
    ...
    <315, 99>
    ...

Integers are the most basic example. The logic gets considerably more complex with string literals for example. Storing lookup indexes in string literal tokens would require the Lexer to resolve the contents of a string (escape sequences, etc). Is this still the Lexer's job or should that be delayed to a later stage like semantic analysis? There is a simmilar question in which string escaping during lexical analysis is addressed.

I can think of the following pros and cons for using such lookup indexes:

Pros

Later stages of the compiler don't need to know the details about the literal and can refer to it using a number.
Smaller memory footprint (strings are usually heap allocated). Each literal is allocated once instead of N times.
The compiler can better optimize due to simple numeric values instead of long strings.

Cons

The Lexer's job gets considerable harder. Semantic validation needs to be performed partially by the Lexer. ("\z" might be an invalid escape)
Changes to the grammer of literals requries the Lexer to be changed. Which, as I learned, should be avoided.

I think this answers your question: https://stackoverflow.com/a/6320259/120163. Short version: The lexer should convert literal values to their native format. If you want to also put them in a table to ensure sharing, that's fine too, although I don't think this helps much with numbers (which are largely 0, 1, 2, 10 and "other") or strings (which are generally pretty unique). — Ira Baxter, Oct 16 '17 at 19:39
@IraBaxter I've voted to close my question because the one you linked perfectly captures my thoughs on the topic. I wonder why I didn't found it earlier... Thank you for sharing :) — Noel Widmer, Oct 16 '17 at 19:51
@IraBaxter Frank deRemer taught me in 1979 that the *code generator* should put things in their native format, not the lexer, in case you're writing a cross-compiler, or want to reuse the pieces for one. — user207421, Oct 16 '17 at 20:51
@EJP You already said that in the linked question. Can you provide some reasons and explanations? — Noel Widmer, Oct 16 '17 at 20:55

Should a Lexer resolve and index literals

0 Answers0