"Lexeme" vs "Token" Terminology

Question

I'm trying to understand the difference between "lexeme" and "token" in compilers.

If the lexer part of my compiler encounters the following sequence of characters in the source code to be compiled.

"abc"

is it correct to say that the above is a lexeme that is 5 characters long?

If my compiler is implemented in C, and I allocate space for a token for this lexeme, the token will be an struct. The first member of the struct will be an int which will have the type from some enum, in this case STRING_LITERAL. The second member of the struct will be a char * that points to some (dynamically allocated) memory that has 4 bytes. The first byte is 'a', the second 'b', the third 'c', and the fourth is NULL to terminate the string.

So...

The lexeme is 5 character of the source code text.

The token is a total of 6 bytes in memory.

Is that the correct way to use the terminology?

(I'm ignoring tokens tracking meta data like filename, line number, and column number.)

Sort of related question:

Is it uncommon practice to have the lexer convert an integer lexeme into an integer value in a token? Or is it better (or more standard) to store the characters of the lexeme in a token and let the parser stage convert those characters to an integer node to be attached to the AST?

Off-topic here. But read the [Dragon Book](https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools), the [Lexical analysis](https://en.wikipedia.org/wiki/Lexical_analysis) and [Parsing](https://en.wikipedia.org/wiki/Parsing) wikipages. And you are wrong, both lexeme and [lexical tokens](https://en.wikipedia.org/wiki/Lexical_analysis#Token) are in general [abstract data types](https://en.wikipedia.org/wiki/Abstract_data_type), not some part of the input stream. In general the input stream is lost after parsing — Basile Starynkevitch, Jun 20 '18 at 04:42
Many compilers (or interpreters) are [free software](https://en.wikipedia.org/wiki/Free_software), so you can study their source code.... — Basile Starynkevitch, Jun 20 '18 at 04:46
In practice,the term "token" and "lexeme" are equivelant. Pick one and stick to it. I prefer "lexeme" because it sounds like something that would come out of a lexer. (Of course, "token" comes out of a tokenizer :-). — Ira Baxter, Jun 20 '18 at 07:37
Regarding your question on whether the lexer keeps the full text of the lexeme or does a conversion, here's my advice: https://stackoverflow.com/questions/6320132/is-it-a-lexers-job-to-parse-numbers-and-strings/6320259#6320259 — Ira Baxter, Jun 20 '18 at 07:37
I gave you a +1 to counter the negative votes. I think your question is perfectly reasonable. — Ira Baxter, Jun 20 '18 at 07:38

score 3 · Answer 1 · answered Aug 14 '18 at 15:21

A "lexeme" is a literal character in the source, for example 'a' is a lexeme in "abc". It is the smallest unit. The "lexer" or lexical analysis stage converts lexemes into tokens(such as keywords, identifiers, literals, operators etc) which are the smallest units the parser can use to create ASTs. So if we have the statement

int x = 0;

The lexer would output

<type:int> <id: x> <operator: = > <literal: 0> <semicolon>

The lexer is typically a collection of regular expressions that can simply define collections of characters as what would be terminals in the languages grammar. These are turned into tokens which is feed into the parser as a stream.

However, most people use lexeme and token interchangeably, and it usually doesn't cause confusion. For you question about converting the int literal, you would want a wrapper class for your AST. Just having a integer alone might not be enough information.

"Lexeme" vs "Token" Terminology

1 Answers1