GCC/Clang lexer and parser

Question

I am curious about how C/C++ lexers and parsers work together. I know the parser generally needs lookahead of at least one token. My question though is, in production compilers (say gcc or clang):

1) Does the lexer run first, lex the entire file, and then let the parser generate the AST. This would mean the lexer generates a list of tokens.

OR

2) Does the lexer only generate a small set of tokens enough for the parser to do its job. This would mean the lexer and parser take turns running.

I definitely think option 1 is used since languages like C++ sometimes need arbitrary lookahead as the grammar isn't context free, but this would take a lot of memory.

Some info here for clang: http://clang.llvm.org/docs/InternalsManual.html ; it's more your option 2 (parser requests tokens from the lexer), but I'm not nearly familiar enough to actually answer. — Mat, Sep 16 '19 at 18:43
Part 1. A couple of things: While it's true that the language C++ is not context free (for example, a variable needs to be defined before it can be used), the grammars used to parse the language are context free. How can this be? For example, a parser based on an LALR(1) grammar will perform *semantic actions* when it performs reductions. Specifically, when it recognizes a variable declaration, it will enter the variable definition into a symbol table. When it recognizes an expression that uses a variable, it can check the symbol table to see if the variable exists. — Booboo, Sep 16 '19 at 19:11
Part 2. In general, lexical analysis and parsing run in parallel with the parser calling the lexer for the "next token" as it needs it. But, this can involve a lot of reading and processing on the part of the lexer. Consider the handling of include files, which may be nested to many levels and contain macro definitions. We know that one of the C/C++ compile-time switches is the ability to only preprocess the input and output a new source file without compiling. But this is not an actual stream of tokens that the parser deals with. You would still need lexical analysis on this output. — Booboo, Sep 16 '19 at 19:20
Neither (1) nor (2). The lexer isn't 'run' at all: it is called as a procedure by the parser every time the parser needs a new token. The lexer returns one token at a time. C++ does not need arbitrary lookahead, and the fact that the grammar at isn't context-free has nothing to do with that. — user207421, Sep 16 '19 at 21:31

score 6 · Accepted Answer · answered Sep 17 '19 at 00:49

The traditional answer is close to your case 2, but not exactly that. Note that lexers and parsers are both typically implemented as relatively simple state machines.

The lexing state machine could be driven from either:

Get me a new token

(which obviously needs to get input codes and assemble them into tokens), or:

Here is a new input character

(which eventually causes tokens to "fall out" of the lexer).

The parser state machine can be driven from either direction:

Get me a parse

(which then has to obtain tokens until it finds a complete sentence), or:

Here is a new token

(which then has to assemble the tokens into a sentence).

If the parser algorithms we use were driven this way, we'd "compile" a file with:

for all input characters:
    feed character to tokenizer

and as tokens "fall out" of the tokenizer, they'd drive the parser. The whole thing would be coroutines driven from the bottom up.

Traditionally, in the parsers generated by yacc, bison, and so on, and in the lexers that serve them, we're running more "top-down", i.e., someone calls a get me a sentence function (which may build an AST, or directly emit code, or something in between—e.g., build one AST for one function or declaration, then turn that into intermediate code, then build another AST for another function or declaration, etc). That drives everything in the direction of pulling tokens in from the lexer—but it's still rather coroutine-ish, as the parser just asks for one token at a time.

This approach is also the obvious way to hand-code a recursive descent parser: your top function is "get me a sentence" (or "get me all sentences" or whatever) and eventually that leads to some function that calls "get me a token". So in both of these cases, the actual expression of the algorithm winds up making repeated "get me a token" calls to the lexer.

GCC has a hand-coded parser (and hand-coded lexer) that works in this way. I haven't looked at the innards of clang but I suspect it's the same.

As to C++ specifically, well, it has some very nasty parsing cases; see https://en.wikipedia.org/wiki/Most_vexing_parse and Pavel Minaev's answer to Is there a good Python library that can parse C++?. Some compilers use ad-hoc methods to deal with this, e.g., provide an overly accepting grammar and try to back-patch the eventual AST, or "steer" the grammar via hacks. (I've seen C++ compilers crash here: feed them syntactically valid tokens that make semantic nonsense, and the hacks can misfire.) Another, arguably much better, method is to use a GLR parser; see Ira Baxter's answer here.

(I haven't done anything parser-theory-ish in ages, and in writing this answer came across sjoerd's comment about GLL parsers from 2011, which is pretty interesting.)

Thank you very very much for this response! It was very helpful :-) ! — Jas, Sep 19 '19 at 01:20

GCC/Clang lexer and parser

1 Answers1