Compiler: limitation of lexical analysis

Question

In classic Compiler theory, the first 2 phases are Lexical Analysis and Parsing. They're in a pipeline. Lexical Analysis recognizes tokens as the input of Parsing.

But I came across some cases which are hard to be correctly recognized in Lexical Analysis. For example, the following code about C++ template:

map<int, vector<int>>

the >> would be recognized as bitwise right shift in a "regular" Lexical Analysis, but it's not correct. My feeling is it's hard to divide the handling of this kind of grammars into 2 phases, the lexing work has to be done in the parsing phase, because correctly parsing the >> relies on the grammar, not only the simple lexical rule.

I'd like to know the theory and practice about this problem. Also, I'd like to know how does C++ compiler handle this case?

I dont know for sure, but I always assumed they had some special case code for this particular occurrence. ie something like `if(input == ">>") check_for_bitwise_right; if(!bitwise_right) check_for_double_template_close;` But I have nothing to back this up. — David says Reinstate Monica, Oct 01 '13 at 01:42
Seems like this should be covered by any compiler course or textbook. — Barmar, Oct 01 '13 at 01:42
I think that many (pre-C++0x) *do* interpret it as bitwise right shift, actually. — Theodoros Chatzigiannakis, Oct 01 '13 at 01:44
This is a special case, it's explicitly mentioned by the standard since it's so annoying — daniel gratzer, Oct 01 '13 at 01:45
What's wrong with lexing ">>" as '>' '>' and letting the parser sort it out? — Ira Baxter, Oct 01 '13 at 01:45
@nightcracker: What about the left hand side? The parser can propose "id < args > " as one parse, and "id < arg >> ..." as another. This requires some ability to look ahead (C++ must look arbitrarily far ahead to resolve this) and most "standard" pure parsing mechanisms (LL(1), LALR(k)) can't do it. GLR has no problem with us. (We build a C++11 parser that works this way, just fine.) But complaining about weak parsers is just that: a complaint about using the wrong technology to solve a problem. (I believe Clang has some funny lookahead hack). — Ira Baxter, Oct 01 '13 at 02:06
Lexing used to be defined as returning the longest match ie '+=' instead of '+' and '=', and standard automata does exactly that, so you suggestion would break that rule. But of course, there is nothing wrong with rewriting the tree during parsing. Also, viewing the parser as a function P : [Token] -> Tree, list of tokens to tree, it seems conceptually simpler to take out the '>>' as one token and push the second '>' back as opposed to look at two to see if you can 'fuse' them. — user1666959, Oct 01 '13 at 02:14
@user1666959: If you don't define a token "+=", you don't have find a longest match for it. Similarly for the non-token (if you care to define it that way) ">>". You can try to bend the parsing machinery to rewrite ">>" as ">", but it doesn't want to do that. Simply processing ">>" as ">" ">" makes it possible to solve all these problems by writing straightforward grammar rules. Why make it hard and confusing? — Ira Baxter, Oct 01 '13 at 03:41
@IraBaxter It's a good idea that lex ">>" as '>' '>' and let the parser sort it out. In theory, we can pass all the letters directly to the parser. Parser has the full grammar info to decide what to do, right? — Dagang, Oct 01 '13 at 03:51
@Todd: In theory, you can parse at the character level. (Some tools actually do that; go find out about "scannerless parsers"). As a practical technique, you want the lexer to collect "as big a chunk" as practical to maximize the value of lightning-fast FSA processing of characters vs. somewhat slower parsing transitions per character. (My limited experience indicates scannerless parsers are significantly slower). But you can decide how big a lexeme is, on a per-lexeme basis. In this case, having a ">>" lexeme just causes grief. "Doctor, doctor, it hurts when I do X". Doctor: "Dont do X". — Ira Baxter, Oct 01 '13 at 03:55
@IraBaxter: there are some problems with this approach. You need to handle `>>` and `> >` differently. The latter is OK in the template context, but is an error otherwise. You also need to handle `>>=`. All is doable if you try hard enough, but why bother? The C++ lexer (and the C lexer) needs feedback from the parser anyway, on a very fundamental level, because of the ambiguity between names, type names and template names. — n. m. could be an AI, Oct 01 '13 at 06:14
@n.m. Easy enough; you can add a semantic check to the grammar rule "shift_operator = '>' '>' ; " which checks that there's no intervening whitespace. And no, C and C++ parsers do not NECESSARILY need feedback between lexer and parser; one can build real, production parsers without it. See my SO discussion of this: http://stackoverflow.com/a/4173543/120163 — Ira Baxter, Oct 01 '13 at 07:28
Asaf, Thanks for your information. I think it's the same idea as that of @IraBaxter, lex ">>" as '>' '>', then handle it in parsing phase. — Dagang, Oct 01 '13 at 08:11
@IraBaxter: yes, one can, but traditionally it's done with feedback, it's easier that way. — n. m. could be an AI, Oct 01 '13 at 08:15
@n.m.: "Its easier that way..." Well, our experience with our approach is pretty good. We have a very small team and a very good front end made possible by the choice to explicitly separate parsing and name resolution issues. The other front ends I know about (GCC and EDG) have gone through conniptions over the years, including GCC throwing parsing with Bison off a cliff and EDG having always hand-built their parsers. These surely can't be "easier" than building a parser directly from grammar rules. Your opinion may vary. — Ira Baxter, Oct 01 '13 at 09:17
I like the idea of making lexing and parsing as 2 phases of a pipeline. This model is much simpler to understand. Mixing them with feedback complicates the model. — Dagang, Oct 01 '13 at 09:30
@IraBaxter: that's certainly an interesting approach, but don't you have to deal with far more grammar ambiguities this way? — n. m. could be an AI, Oct 01 '13 at 09:50
@n.m.: the parser has to handle ambiguities, which GLR does with aplomb. We use attribute grammars to walk the parse tree (with ambiguity nodes) and perform name resolution. Anyplace the name resolution decides that a particular child tree of an ambiguous node isn't sanely typeable, it simply deletes that child tree. The remaining tree is the typed version of the program with the ambiguities gone. This is clean and sane and thus emininently engineerable. The attribute grammars make it even almost pretty. ... — Ira Baxter, Oct 01 '13 at 10:13
... An ideal solution would, using program transformations, compose the attribute grammars (at least the left-to-right part :) with parsing process, giving something like the "parser feedback hack" implemented in a really clean way. — Ira Baxter, Oct 01 '13 at 10:14
@IraBaxter Ah, attribute grammars, that's nice. I should try them one day. But I don't think they teach them in Compilers 101 or whatever the OP is taking. I'm not sure an undergrad student would find them easier to understand than the feedback hack... — n. m. could be an AI, Oct 01 '13 at 10:31
I'm not sure an undergrad would understand (or should need to understand) the need for the feedback hack in C++. He should be learning the basics of building a parser, not the dark corners of hacks needed to make C++ work. My push for GLR grammars and simple token schemes means he can focus on learning about grammars and parsers, and leave all the dark corner trash for when he really needs it. — Ira Baxter, Oct 01 '13 at 20:14

willj · Accepted Answer · 2013-10-01T10:49:34.950

The C++ standard requires that an implementation perform lexical analysis to produce a stream of tokens, before the parsing stage. According to the lexical analysis rules, two consecutive > characters (not followed by =) will always be interpreted as one >> token. The grammar provided with the C++ standard is defined in terms of these tokens.

The requirement that in certain contexts (such as when expecting a > within a template-id) the implementation should interpret >> as two > is not specified within the grammar. Instead the rule is specified as a special case:

14.2 Names of template specializations [temp.names] ###

After name lookup (3.4) finds that a name is a template-name or that an operator-function-id or a literal-operator-id refers to a set of overloaded functions any member of which is a function template if this is followed by a <, the < is always taken as the delimiter of a template-argument-list and never as the less-than operator. When parsing a template-argument-list, the first non-nested > is taken as the ending delimiter rather than a greater-than operator. Similarly, the first non-nested >> is treated as two consecutive but distinct > tokens, the first of which is taken as the end of the template-argument-list and completes the template-id. [ Note: The second > token produced by this replacement rule may terminate an enclosing template-id construct or it may be part of a different construct (e.g. a cast).—end note ]

Note the earlier rule, that in certain contexts < should be interpreted as the < in a template-argument-list. This is another example of a construct that requires context in order to disambiguate the parse.

The C++ grammar contains many such ambiguities which cannot be resolved during parsing without information about the context. The most well known of these is known as the Most Vexing Parse, in which an identifier may be interpreted as a type-name depending on context.

Keeping track of the aforementioned context in C++ requires an implementation to perform some semantic analysis in parallel with the parsing stage. This is commonly implemented in the form of semantic actions that are invoked when a particular grammatical construct is recognised in a given context. These semantic actions then build a data structure that represents the context and permits efficient queries. This is often referred to as a symbol table, but the structure required for C++ is pretty much the entire AST.

These kind of context-sensitive semantic actions can also be used to resolve ambiguities. For example, on recognising an identifier in the context of a namespace-body, a semantic action will check whether the name was previously defined as a template. The result of this will then be fed back to the parser. This can be done by marking the identifier token with the result, or replacing it with a special token that will match a different grammar rule.

The same technique can be used to mark a < as the beginning of a template-argument-list, or a > as the end. The rule for context-sensitive replacement of >> with two > poses essentially the same problem and can be resolved using the same method.

GLR parsing probably deserves a mention, at least from a theoretical perspective. It's not used by any production-ready compiler that I know of. I'm also unaware of a GLR parser for C++ that beats the performance of existing hand-coded parsers (e.g. GCC/Clang). I would love to be enlightened! — willj, Oct 01 '13 at 11:07
For example, marking a `<` token as the beginning of a *template-argument-list* may be achieved by changing its token-id from `LESS_THAN` to `LEFT_ANGLE_BRACKET`, and defining *template-argument-list* in terms of `LEFT_ANGLE_BRACKET`. — willj, Oct 01 '13 at 11:45
Our GLR parser doesn't beat the hand-coded ones in speed. But we're not building a compiler; we're building a program transformation system, and optimizing on parse speed hasn't been a real need. There is plenty of work showing how to make L(AL)R parsers blisteringly fast, and most of the GLR states act like L(AL)R states. People have even build GLR parsers the act L(ALR) where the states are L(AL)R. So it should be straightforward engineering to build blistering fast GLR parsers for most of the parse states, and the rest shouldn't matter much. Agreed, nobody has done that yet. — Ira Baxter, Oct 01 '13 at 20:18

score 1 · Answer 2 · answered Oct 01 '13 at 11:07

You are right, the theoretical clean distinction between lexer and parser is not always possible. I remember a porject I worked on as a student. We were to implement a C compiler, and the grammar we used as a basis would treat typedefined names as types in some cases, as identifiers in others. So the lexer had to switch between these two modes. The way I implemented this back then was using special empty rules, which reconfigured the lexer depending on context. To accomplish this, it was vital to know that the parser would always use exactly one token of look-ahead. So any change to lexer behaviour would have to occur at least one lexiacal token before the affected location. In the end, this worked quite well.

In the C++ case of >> you mention, I don't know what compilers actually do. willj quoted how the specification phrases this, but implementations are allowed to do things differently internally, as long as the visible result is the same. So here is how I'd try to tackle this: upon reading a >, the lexer would emit token GREATER, but also switch to a state where each subsequent > without a space in between would be lexed to GREATER_REPEATED. Any other symbol would switch the state back to normal. Instead of state switches, you could also do this by lexing the regular expression >+, and emitting multiple tokens from this rule. In the parser, you could then use rules like the following:

rightAngleBracket: GREATER | GREATER_REPEATED;
rightShift: GREATER GREATER_REPEATED;

With a bit of luck, you could make template argument rules use rightAngleBracket, while expressions would use rightShift. Depending on how much look-ahead your parser has, it might be neccessary to introduce additional non-terminals to hold longer sequences of ambiguous content, until you encounter some context which allows you to eventually make the decision between these cases.

Compiler: limitation of lexical analysis

2 Answers2

14.2 Names of template specializations [temp.names] ###