My goal is to build a parser for a reasonable subset C and right now I'm at the start, implementing the lexer. Answers to a similar question on the same topic pointed towards the International Standard for C (700 pages of documentation) and the Yacc grammar webpage.
I would welcome any help with understanding the documentation: Is it true that the following picture from the documentation represents grammar rules, where the notation C -> (A, B)
means that all occurrences of AB
in that order get replaced by C
?
identifier -> identifier-nondigit | (identifier,identifier-nondigit) | (identifier,digit)
identifier-nondigit -> nondigit | universal-character-name | other
digit -> 0 | 1 | 2 | ... | 9
non-digit -> _ | a | b | ... | z | A | ... | Z
I think I am confused because the documentation introduces 'preprocessing tokens' which I thought would be just labels of sequences of characters in the source produced without backtracking.
I.e. something like:
"15647 \n \t abdsfg8rg \t" -> "DWLDLW"
// D .. digits, W ... whitespace, L ... letters
It seems like the lexer is doing the same thing as the parser (just building a tree). What is the reason for introducing the preprocessing tokens and tokens?
Does it mean that the processing should be done 'in two waves'?
I was expecting the lexer to just use some regular expressions and maybe a few rules. But it seems like the result of lexing is a sequence of trees that can have the roots keyword, identifier, constant, string-literal, punctuator
.
Thank you for any clarifications.