I'm trying to understand the difference between "lexeme" and "token" in compilers.
If the lexer part of my compiler encounters the following sequence of characters in the source code to be compiled.
"abc"
is it correct to say that the above is a lexeme that is 5 characters long?
If my compiler is implemented in C, and I allocate space for a token for this lexeme, the token will be an struct. The first member of the struct will be an int
which will have the type from some enum, in this case STRING_LITERAL. The second member of the struct will be a char *
that points to some (dynamically allocated) memory that has 4 bytes. The first byte is 'a'
, the second 'b'
, the third 'c'
, and the fourth is NULL
to terminate the string.
So...
The lexeme is 5 character of the source code text.
The token is a total of 6 bytes in memory.
Is that the correct way to use the terminology?
(I'm ignoring tokens tracking meta data like filename, line number, and column number.)
Sort of related question:
Is it uncommon practice to have the lexer convert an integer lexeme into an integer value in a token? Or is it better (or more standard) to store the characters of the lexeme in a token and let the parser stage convert those characters to an integer node to be attached to the AST?