7

After reading this question I am left wondering what happens (regarding the AST) when major C++ compilers parse code like this:

struct foo 
{
  void method() { a<b>c; }

  // a b c may be declared here
};

Do they handle it like a GLR parser would or in a different way? What other ways are there to parse this and similar cases?

For example, I think it's possible to postpone parsing the body of the method until the whole struct has been parsed, but is this really possible and practical?

panoskj
  • 93
  • 5

2 Answers2

2

The answer will obviously depend on the compiler, but the article How Clang handles the type / variable name ambiguity of C/C++ by Eli Bendersky explains how Clang does it. I will simply note some key points from the article:

  • Clang has no need for a lexer hack: the information goes in a single direction from lexer to parser

  • Clang knows when an identifier is a type by using a symbol table

  • C++ requires declarations to be visible throughout the class, even in code that appears before it

  • Clang gets around this by doing a full parsing/semantic analysis of the declaration, but leaving the definition for later; in other words, it's lexed but parsed after all the declared types are available

  • "Clang has no need for a lexer hack". OK, so that's an improvement over having a lexer hack, but it sounds from your description like they replaced with it with a correspondingly icky parser hack. One man's opinion. – Ira Baxter Oct 27 '18 at 16:43
  • Although I accepted another answer, because it is more complete on its own, the article you provided is really insightful and I recommend reading it. – panoskj Oct 28 '18 at 09:47
2

Although it is certainly possible to use GLR techniques to parse C++ (see a number of answers by Ira Baxter), I believe that the approach commonly used in commonly-used compilers such as gcc and clang is precisely that of deferring the parse of function bodies until the class definition is complete. (Since C++ source code passes through a preprocessor before being parsed, the parser works on streams of tokens and that is what must be saved in order to reparse the function body. I don't believe that it is feasible to reparse the source code.)

It's easy to know when a function definition is complete, since braces ({}) must balance even if it is not known how angle brackets nest.

C++ is not the only language in which it is useful to defer parsing until declarations have been handled. For example, a language which allows users to define new operators with different precedences would require all expressions to be (re-)parsed once the names and precedences of operators are known. A more pathological example is COBOL, in which the precedence of OR in a = b OR c depends on whether c is an integer (a is equal to one of b or c) or a boolean (a is equal to b or c is true). Whether designing languages in this manner is a good idea is another question.

rici
  • 234,347
  • 28
  • 237
  • 341
  • Very informative answer, though it could use some reference(s) about gcc/clang deferring the parse. – panoskj Oct 27 '18 at 12:49
  • @panoskj: both compilers are open source; you can read the code yourself. I'm not going to do a full description of the parsing code here, particularly since I haven't looked at them in a while. – rici Oct 27 '18 at 13:07
  • @rici: it wasn't clear you have read gcc/clang code, in which case I'll take your word for it. – panoskj Oct 28 '18 at 09:23