160

I was reading about parsers and parser generators and found this statement in wikipedia's LR parsing -page:

Many programming languages can be parsed using some variation of an LR parser. One notable exception is C++.

Why is it so? What particular property of C++ causes it to be impossible to parse with LR parsers?

Using google, I only found that C can be perfectly parsed with LR(1) but C++ requires LR(∞).

zython
  • 1,176
  • 4
  • 22
  • 50
Cheery
  • 24,645
  • 16
  • 59
  • 83

6 Answers6

247

LR parsers can't handle ambiguous grammar rules, by design. (Made the theory easier back in the 1970s when the ideas were being worked out).

C and C++ both allow the following statement:

x * y ;

It has two different parses:

  1. It can be the declaration of y, as pointer to type x
  2. It can be a multiply of x and y, throwing away the answer.

Now, you might think the latter is stupid and should be ignored. Most would agree with you; however, there are cases where it might have a side effect (e.g., if multiply is overloaded). but that isn't the point. The point is there are two different parses, and therefore a program can mean different things depending on how this should have been parsed.

The compiler must accept the appropriate one under the appropriate circumstances, and in the absence of any other information (e.g., knowledge of the type of x) must collect both in order to decide later what to do. Thus a grammar must allow this. And that makes the grammar ambiguous.

Thus pure LR parsing can't handle this. Nor can many other widely available parser generators, such as Antlr, JavaCC, YACC, or traditional Bison, or even PEG-style parsers, used in a "pure" way.

There are lots of more complicated cases (parsing template syntax requires arbitrary lookahead, whereas LALR(k) can look ahead at most k tokens), but only it only takes one counterexample to shoot down pure LR (or the others) parsing.

Most real C/C++ parsers handle this example by using some kind of deterministic parser with an extra hack: they intertwine parsing with symbol table collection... so that by the time "x" is encountered, the parser knows if x is a type or not, and can thus choose between the two potential parses. But a parser that does this isn't context free, and LR parsers (the pure ones, etc.) are (at best) context free.

One can cheat, and add per-rule reduction-time semantic checks in the to LR parsers to do this disambiguation. (This code often isn't simple). Most of the other parser types have some means to add semantic checks at various points in the parsing, that can be used to do this.

And if you cheat enough, you can make LR parsers work for C and C++. The GCC guys did for awhile, but gave it up for hand-coded parsing, I think because they wanted better error diagnostics.

There's another approach, though, which is nice and clean and parses C and C++ just fine without any symbol table hackery: GLR parsers. These are full context free parsers (having effectively infinite lookahead). GLR parsers simply accept both parses, producing a "tree" (actually a directed acyclic graph that is mostly tree like) that represents the ambiguous parse. A post-parsing pass can resolve the ambiguities.

We use this technique in the C and C++ front ends for our DMS Software Reengineering Tookit (as of June 2017 these handle full C++17 in MS and GNU dialects). They have been used to process millions of lines of large C and C++ systems, with complete, precise parses producing ASTs with complete details of the source code. (See the AST for C++'s most vexing parse.)

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • 11
    While the 'x * y' example is interesting, the same can happen in C ('y' can be a typedef or a variable). But C can be parsed by a LR(1) parser, so what's the difference with C++? – Martin Cote Jun 17 '09 at 02:16
  • 14
    My answser had already observed that C had the same problem, I think you missed that. No, it can't be parsed by LR(1), for the same reason. Er, what do you mean 'y' can be a typedef? Perhaps you meant 'x'? That doesn't change anything. – Ira Baxter Jun 17 '09 at 06:06
  • 6
    Parse 2 is not necessarily stupid in C++, as * could be overridden to have side effects. – Dour High Arch Sep 09 '09 at 16:15
  • 1
    In the provided example, it doesn't have any side effects. You are right, there's variations on this theme (say, overloaded "*") which could have side effects. As stated, people *might* think it was stupid. I've adjusted the text. – Ira Baxter Sep 09 '09 at 17:52
  • 1
    I would count overloading a stateless operator to include side effects as a dumb use (flame bait but I'll elaborate). I always had that problem with operator overloading: none of the properties of an operator that would make it a reasonable analogy to the original use are guaranteed. For example, the comparison operators < > <= >= absolutely don't have to work as a partial order. – Aaron Altman Aug 12 '10 at 19:32
  • 1
    @altie: agreed, when you have overloading you want something like the Liskov substitution principle, which says the key properties of the overloaded operator are preserved. Alas, none of the OO langauges I know about (except Liskov's CLU) insist on that. So, you can object that the overload in C++ is dumb, but it is legal. – Ira Baxter Aug 12 '10 at 20:04
  • 8
    I looked at `x * y` and chuckled - it's amazing how little one thinks of nifty little ambiguities like this. – new123456 Jun 30 '11 at 20:33
  • 3
    The question is about the difference between C and C++ - C can be parsed by LALR(1) with the lexer hack (which you describe), while it seems (from other answers) that even that is impossible for C++. – Blaisorblade Jan 06 '12 at 03:13
  • @Blaisorblade: What people do is get some parser technology that works for most cases, and then hack it handle other cases, either by accepting "too much" and eliminating that later, or by bending the parsing machinery (lexer hack as one example, ad hoc look-ahead, etc.) So I think you can do it; its just awful. – Ira Baxter Jan 06 '12 at 05:33
  • 1
    @IraBaxter: indeed, apparently the PhD thesis cited by Rob Walker, about Fog, implements a parser for C++ (in the middle of implementing some other goal) consisting of a LALR(1) superset grammar and defers significant processing to resolve ambiguities to subsequent stages. So in a way you're right - although this is much uglier than what's needed for C. In another sense, though, not just C but many other programming languages have some context-sensitive semantic processing after parsing, like for instance identifier resolution and type-checking (I don't know about scripting languages). – Blaisorblade Jan 11 '12 at 16:42
  • 3
    @Blaisorblad: you don't have a lot of choice. If your parser accepts "too little", you probably reject real programs. You can't make a parser accept *exactly* a valid string unless your parser is Turing capable. So, you build a context free parser that accepts too much, and hack away the excess that it accepts. – Ira Baxter Jan 11 '12 at 23:25
  • 56
    @altie Surely no one would overload a bit-shift operator to make it write most variable types to a stream, right? – Troy Daniels Mar 12 '13 at 21:24
  • @Troy: How is that different than Java overloading the add operator "+" to concenate two strings? – Ira Baxter Sep 07 '19 at 21:15
  • 1
    @IraBaxter There are similarities, although it seems like concatenation is a logical interpretation of "adding" when applied to strings, much more than bit-shifting an output stream. If I asked a non-programmer to "add 'abc' to 'def'", I feel I have a decent chance of getting 'defabc' as the answer. If I ask them to "shift the output by 'xyzzy' bits", I feel a very confused expression is about the best I can hope for. – Troy Daniels Sep 09 '19 at 17:36
  • @Troy: yes, I kind of agree about the intuition (although if you ask most non-C/C++ programmers what "<<" does, you won't get an answer). But in either case, the semantics of the different interpretations are simply... different. So its case of what arbitrary semantics, do you want to attach to what arbitrary syntax? – Ira Baxter Sep 09 '19 at 20:01
  • 2
    (x)-x is also an example of ambiguity, where (x) may be interpreted as an expression or as a type casting. Moreover, this expression may be used in any meaningful context, avoiding the discussion about overloading *. – dodecaplex Nov 13 '19 at 10:22
  • Grasshopper: you are swimming against the current. You can argue that C's grammar is unnecessarily complex; if you follow that argument to the limit you can always define your language as S-expressions as LISP does, and then parsers are basically trivial. If you think that LISP S-expressions are clumsy, then you will design a language with lots of idiomatic forms. Over time, people graft extra stuff in. .... – Ira Baxter Nov 01 '21 at 03:25
  • ... The modern version of pretty much every language I'm familiar with, has things that make recursive descent parsing harder than you'd like. If you use the right parser generator, you stop caring because these variations are all handled by the parser generator without fuss. (Seriously, check out GLL or GLR Parsing). Regarding C, you'll find parsing it harder than you think due to the need to implement an accurate preprocessor, and (if you insist on recursive descent), the need to track types during parsing. That latter for me tangles parsing and symbol tables, and that's just stupid. – Ira Baxter Nov 01 '21 at 03:27
  • If I understand things correctly, modern C++ allowing a type definition `typedef int x;` as part of a class definition before or _after_ some method of said class containing `x * y;`, would ostensibly make the program unparsable for LR parsers with insufficient look-ahead, _specifically_ even those that would employ a "hack" utilizing symbol collection. The problem lies in that they'd simply have to defer determining nature of `x`, until a _later_ `typedef`, if there's one at all. Solving that would make for a far more complicated hack, wouldn't it? – Armen Michaeli Apr 06 '23 at 20:38
  • @ArmenMichaeli: (Yes). If you insist your parser needs type information for identifiers, and your language allows identifer use earlier in the source, with type declaration later, then LALR parsing will be a complete disaster precisely because it is *left* to right parsing. It will encounter the identifier without type information... now what does it do to proceed? You can hack the LALR parser to backtrack, and then perhaps you can assume a type on first encounter and verify it later. Even that will fail if an identifier use *might* be of many types because of too much backtracking. – Ira Baxter Apr 07 '23 at 03:03
97

There is an interesting thread on Lambda the Ultimate that discusses the LALR grammar for C++.

It includes a link to a PhD thesis that includes a discussion of C++ parsing, which states that:

"C++ grammar is ambiguous, context-dependent and potentially requires infinite lookahead to resolve some ambiguities".

It goes on to give a number of examples (see page 147 of the pdf).

The example is:

int(x), y, *const z;

meaning

int x;
int y;
int *const z;

Compare to:

int(x), y, new int;

meaning

(int(x)), (y), (new int));

(a comma-separated expression).

The two token sequences have the same initial subsequence but different parse trees, which depend on the last element. There can be arbitrarily many tokens before the disambiguating one.

anatolyg
  • 26,506
  • 9
  • 60
  • 134
Rob Walker
  • 46,588
  • 15
  • 99
  • 136
  • 29
    It'd be cool to have some summary about the page 147 on this page. I'm going to read that page though. (+1) – Cheery Oct 28 '08 at 14:11
  • 11
    The example is: int(x), y, *const z; //meaning: int x; int y; int *const z; (a sequence of declarations) int(x), y, new int; //meaning: (int(x)), (y), (new int)); (a comma-separated expression) The two token sequences have the same initial subsequence but different parse trees, which depend on the last element. There can be arbitrarily many tokens before the disambiguating one. – Blaisorblade Jan 06 '12 at 03:20
  • 6
    Well, in that context ∞ means "arbitrarily many" because the lookahead will always be bounded by the input length. – MauganRa May 12 '14 at 22:43
  • 2
    I'm quite puzzled by the citation extracted from a PhD Thesis. If there is an ambiguïty, then, by definition, NO lookahead may ever "resolve" the ambiguity (i.e. decide which parse is the correct oen, as at least 2 parses are considered correct by the grammar). Moreover, the citation mentions the ambiguity of C but the explanation, does not show an ambiguity, but only a non ambiguous example where parsing decision can only be taken after an arbitrary long look-ahead. – dodecaplex Nov 13 '19 at 10:16
15

The problem is never defined like this, whereas it should be interesting :

what is the smallest set of modifications to C++ grammar that would be necessary so that this new grammar could be perfectly parsed by a "non-context-free" yacc parser ? (making use only of one 'hack' : the typename/identifier disambiguation, the parser informing the lexer of every typedef/class/struct)

I see a few ones:

  1. Type Type; is forbidden. An identifier declared as a typename cannot become a non-typename identifier (note that struct Type Type is not ambiguous and could be still allowed).

    There are 3 types of names tokens :

    • types : builtin-type or because of a typedef/class/struct
    • template-functions
    • identifiers : functions/methods and variables/objects

    Considering template-functions as different tokens solves the func< ambiguity. If func is a template-function name, then < must be the beginning of a template parameter list, otherwise func is a function pointer and < is the comparison operator.

  2. Type a(2); is an object instantiation. Type a(); and Type a(int) are function prototypes.

  3. int (k); is completely forbidden, should be written int k;

  4. typedef int func_type(); and typedef int (func_type)(); are forbidden.

    A function typedef must be a function pointer typedef : typedef int (*func_ptr_type)();

  5. template recursion is limited to 1024, otherwise an increased maximum could be passed as an option to the compiler.

  6. int a,b,c[9],*d,(*f)(), (*g)()[9], h(char); could be forbidden too, replaced by int a,b,c[9],*d; int (*f)();

    int (*g)()[9];

    int h(char);

    one line per function prototype or function pointer declaration.

    An highly preferred alternative would be to change the awful function pointer syntax,

    int (MyClass::*MethodPtr)(char*);

    being resyntaxed as:

    int (MyClass::*)(char*) MethodPtr;

    this being coherent with the cast operator (int (MyClass::*)(char*))

  7. typedef int type, *type_ptr; could be forbidden too : one line per typedef. Thus it would become

    typedef int type;

    typedef int *type_ptr;

  8. sizeof int, sizeof char, sizeof long long and co. could be declared in each source file. Thus, each source file making use of the type int should begin with

    #type int : signed_integer(4)

    and unsigned_integer(4) would be forbidden outside of that #type directive this would be a big step into the stupid sizeof int ambiguity present in so many C++ headers

The compiler implementing the resyntaxed C++ would, if encountering a C++ source making use of ambiguous syntax, move source.cpp too an ambiguous_syntax folder, and would create automatically an unambiguous translated source.cpp before compiling it.

Please add your ambiguous C++ syntaxes if you know some!

Community
  • 1
  • 1
reuns
  • 854
  • 10
  • 14
  • 4
    C++ is too well entrenched. Nobody will do this in practice. Those folks (like us) that build front ends simply bite the bullet and do the engineering to make the parsers work. And, as long as templates exist in the language, you're not going to get a pure context-free parser. – Ira Baxter Mar 09 '13 at 09:44
  • @IraBaxter, and then along came good ol' Herb with [cppfront](https://github.com/hsutter/cppfront). He even states he wants a CFG for C++. – SO_fix_the_vote_sorting_bug Dec 05 '22 at 03:39
  • @SO_fix_the_vote_sorting_bug: So, 10 years after my last remark, somebody tries to build such a tool. Herb is very smart guy, and he has built himself a toy (quote: "hilariously incomplete"). When it is production quality and widely accepted, we can re-discuss whether it will take over from standard C++. – Ira Baxter Dec 05 '22 at 04:08
9

As you can see in my answer here, C++ contains syntax that cannot be deterministically parsed by an LL or LR parser due to the type resolution stage (typically post-parsing) changing the order of operations, and therefore the fundamental shape of the AST (typically expected to be provided by a first-stage parse).

Community
  • 1
  • 1
Sam Harwell
  • 97,721
  • 20
  • 209
  • 280
  • 4
    Parsing technology that handles ambiguity simply produces *both* AST variants as they parse, and simply eliminate the incorrect one depending on the type information. – Ira Baxter Sep 02 '09 at 16:51
  • @Ira: Yes, that is correct. The particular advantage of that is it allows you maintain the separation of the first-stage parse. While it's most commonly known in the GLR parser, there's no particular reason I see that you couldn't hit C++ with a "GLL?" parser as well. – Sam Harwell Sep 02 '09 at 17:50
  • "GLL"? Well, sure, but you'll have to go figure out the theory and write up a paper for the rest of to use. More likely, you could use a top down hand coded parser, or a backtracking LALR() parser (but keep the "rejected") parses, or run an Earley parser. GLR has the advantage of being a pretty damn good solution, is well documented and by now well proved. A GLL technology would have to have some pretty significant advantages to display GLR. – Ira Baxter Jan 28 '10 at 18:28
  • The Rascal project (Netherlands) is claiming they are building a scannerless GLL parser. Work in progress, might be hard to find any online information. http://en.wikipedia.org/wiki/RascalMPL – Ira Baxter Aug 21 '10 at 21:46
  • 1
    @IraBaxter There seems to be new developments on GLL: see this 2010 paper about GLL http://dotat.at/tmp/gll.pdf – Sjoerd Nov 19 '11 at 18:47
6

I think you are pretty close to the answer.

LR(1) means that parsing from left to right needs only one token to look-ahead for the context, whereas LR(∞) means an infinite look-ahead. That is, the parser would have to know everything that was coming in order to figure out where it is now.

Ry-
  • 218,210
  • 55
  • 464
  • 476
casademora
  • 67,775
  • 17
  • 69
  • 78
  • 4
    I recall from my compilers class that LR(n) for n > 0 is mathematically reducable to LR(1). Is that not true for n = infinity? – rmeador Oct 28 '08 at 14:41
  • 15
    No, there's an impassable mountain of a difference between n and infinity. – ephemient Oct 28 '08 at 14:53
  • 4
    Isn't the answer: Yes, given an infinite amount of time? :) – Steve Fallows Oct 28 '08 at 15:56
  • 7
    Actually, by my *vague* recollection of how LR(n) -> LR(1) takes place, it involves creating new intermediate states, so the runtime is some non-constant function of 'n'. Translating LR(inf) -> LR(1) would take infinite time. – Aaron Dec 05 '08 at 21:07
  • 8
    "Isn't the answer: Yes, given an infinite amount of time?" -- No: the phrase 'given an infinite amount of time' is just a non-sensical, short-hand way of saying "cannot be done given any finite amount of time". When you see "infinite", think: "not any finite". – ChrisW Jun 17 '09 at 02:10
  • 1
    Infinite lookahead may be necessary, but it isn't sufficient. See my answer on the necessity to use context to resolve the choice. – Ira Baxter Mar 09 '13 at 09:42
6

The "typedef" problem in C++ can be parsed with an LALR(1) parser that builds a symbol table while parsing (not a pure LALR parser). The "template" problem probably cannot be solved with this method. The advantage of this kind of LALR(1) parser is that the grammar (shown below) is an LALR(1) grammar (no ambiguity).

/* C Typedef Solution. */

/* Terminal Declarations. */

   <identifier> => lookup();  /* Symbol table lookup. */

/* Rules. */

   Goal        -> [Declaration]... <eof>               +> goal_

   Declaration -> Type... VarList ';'                  +> decl_
               -> typedef Type... TypeVarList ';'      +> typedecl_

   VarList     -> Var /','...     
   TypeVarList -> TypeVar /','...

   Var         -> [Ptr]... Identifier 
   TypeVar     -> [Ptr]... TypeIdentifier                               

   Identifier     -> <identifier>       +> identifier_(1)      
   TypeIdentifier -> <identifier>      =+> typedefidentifier_(1,{typedef})

// The above line will assign {typedef} to the <identifier>,  
// because {typedef} is the second argument of the action typeidentifier_(). 
// This handles the context-sensitive feature of the C++ language.

   Ptr          -> '*'                  +> ptr_

   Type         -> char                 +> type_(1)
                -> int                  +> type_(1)
                -> short                +> type_(1)
                -> unsigned             +> type_(1)
                -> {typedef}            +> type_(1)

/* End Of Grammar. */

The following input can be parsed without a problem:

 typedef int x;
 x * y;

 typedef unsigned int uint, *uintptr;
 uint    a, b, c;
 uintptr p, q, r;

The LRSTAR parser generator reads the above grammar notation and generates a parser that handles the "typedef" problem without ambiguity in the parse tree or AST. (Disclosure: I am the guy who created LRSTAR.)

  • 2
    That's the standard hack used by GCC with its former LR parser to handle the ambiguity of things like "x*y;" Alas, there's still the arbitrarily large lookahead requirement to parse other constructs, so LR(k) fails to be a solution any fixed k. (GCC switched to recursive descent with more ad hockery). – Ira Baxter Feb 04 '20 at 11:20
  • LRSTAR is at https://sourceforge.net/projects/lrstar/ – Paul B Mann Feb 05 '22 at 19:09
  • "LRSTAR is at sourceforge.net/projects/lrstar" seems to be a link that is no longer valid ! – user2228895 Nov 08 '22 at 08:20