90

I'm acquainted with the fact that the grammars of C and C++ are context-sensitive, and in particular you need a "lexer hack" in C. On the other hand, I'm under the impression that you can parse Java with only 2 tokens of look-ahead, despite considerable similarity between the two languages.

What would you have to change about C to make it more tractable to parse?

I ask because all of the examples I've seen of C's context-sensitivity are technically allowable but awfully weird. For example,

foo (a);

could be calling the void function foo with argument a. Or, it could be declaring a to be an object of type foo, but you could just as easily get rid of the parantheses. In part, this weirdness occurs because the "direct declarator" production rule for the C grammar fulfills the dual purpose of declaring both functions and variables.

On the other hand, the Java grammar has separate production rules for variable declaration and function declaration. If you write

foo a;

then you know it's a variable declaration and foo can unambiguously be parsed as a typename. This might not be valid code if the class foo hasn't been defined somewhere in the current scope, but that's a job for semantic analysis that can be performed in a later compiler pass.

I've seen it said that C is hard to parse because of typedef, but you can declare your own types in Java too. Which C grammar rules, besides direct_declarator, are at fault?

Community
  • 1
  • 1
Daniel Shapero
  • 1,869
  • 17
  • 30
  • 7
    Cool question. Probably way too broad or primarily opinionated though. – asteri Oct 12 '14 at 22:01
  • 37
    This is a valid question about parsers and the only thing broad or opinion based about it is the last couple sentences (which should probably be dropped or changed). Quit with the close votes. – R.. GitHub STOP HELPING ICE Oct 12 '14 at 22:10
  • 3
    Virtually *every* (standard) computer language is *context sensitive*; you can't declare a variable of one type, and misuse it most *langauges*. That is different than "all *grammars* for the language" are context sensitive; most people building parsers build a context-free (or even more restrictive) parser, and then use hacks outside the parser to check the context-free properties. – Ira Baxter Oct 12 '14 at 22:43
  • 1
    @IraBaxter I wouldn't call that "hacks". Splitting the problem in two seems a reasonable thing to do, since parsing context-sensitive languages cannot be done efficiently (and in fact even parsing context-free languages isn't efficient, and that's why we generally restrict to subsets of context-free). A context-free parse + static analysis to check only context-sensitive properties over the AST it's a reasonable thing to do. – Bakuriu Oct 13 '14 at 05:32
  • by the way, I believe Knuth proved that languages with `k` lookahead tokens are actually equivalent to languages with `1` lookahead token, i.e. `LL(k)` is the same set of languages as `LL(1)` (although not sure whether this result is valid for `LR` and `LALR` grammars too). The fact that the standard or some compiler doesn't use a grammar with only `1` token of lookahead is not a problem of the language itself but of the implementation. – Bakuriu Oct 13 '14 at 05:35
  • All: "to check the context-free properties" should be "to check the context-sensitive properties". – Ira Baxter Oct 13 '14 at 08:30
  • @Bakuriu: I called them "hacks" because they aren't part of the parsing machinery, and don't have an obvious structure. If you prefer to call them "ad hoc static analyses" instead, I can live with that. I agree the partitioning into two problems is the right thing to do; I was pointing out that is what happens in most cases. Regarding "efficient" parsing: GLR does very well on full context-free parsers in practice; I have some 40+ grammars for mainstream languages using this (you can construct arcane context-free grammars that give it hard time but they just don't occur in the real world). – Ira Baxter Oct 13 '14 at 08:33
  • @Bakuriu: I also believe somebody (Knuth?) proved that. I think the construction of LR(1) from LR(k) etc. blows up the size of the grammar, which makes it impractical to use LR(1) for LR(k). And this does not handle context-free grammars that need arbitrary lookahead, which are common. GLR does all of these reasonably efficiently without making you twist a conceptually straightforward grammar. – Ira Baxter Oct 13 '14 at 08:40

1 Answers1

76

Parsing C++ is getting hard. Parsing Java is getting to be just as hard.

See this SO answer discussing why C (and C++) is "hard" to parse. The short summary is that C and C++ grammars are inherently ambiguous; they will give you multiple parses and you must use context to resolve the ambiguities. People then make the mistake of assuming you have to resolve ambiguities as you parse; not so, see below. If you insist on resolving ambiguities as you parse, your parser gets more complicated and that much harder to build; but that complexity is a self-inflicted wound.

IIRC, Java 1.4's "obvious" LALR(1) grammar was not ambiguous, so it was "easy" to parse. I'm not so sure that modern Java hasn't got at least long distance local ambiguities; there's always the problem of deciding whether "...>>" closes off two templates or is a "right shift operator". I suspect modern Java does not parse with LALR(1) anymore.

But one can get past the parsing problem by using strong parsers (or weak parsers and context collection hacks as C and C++ front ends mostly do now), for both languages. C and C++ have the additional complication of having a preprocessor; these are more complicated in practice than they look. One claim is that the C and C++ parsers are so hard they have to be be written by hand. It isn't true; you can build Java and C++ parsers just fine with GLR parser generators.

But parsing isn't really where the problem is.

Once you parse, you will want to do something with the AST/parse tree. In practice, you need to know, for every identifier, what its definition is and where it is used ("name and type resolution", sloppily, building symbol tables). This turns out to be a LOT more work than getting the parser right, compounded by inheritance, interfaces, overloading and templates, and the confounded by the fact that the semantics for all this is written in informal natural language spread across tens to hundreds of pages of the language standard. C++ is really bad here. Java 7 and 8 are getting to be pretty awful from this point of view. (And symbol tables aren't all you need; see my bio for a longer essay on "Life After Parsing").

Most folks struggle with the pure parsing part (often never finishing; check SO itself for the many, many questions about to how to build working parsers for real langauges), so they don't ever see life after parsing. And then we get folk theorems about what is hard to parse and no signal about what happens after that stage.

Fixing C++ syntax won't get you anywhere.

Regarding changing the C++ syntax: you'll find you need to patch a lot of places to take care of the variety of local and real ambiguities in any C++ grammar. If you insist, the following list might be a good starting place. I contend there is no point in doing this if you are not the C++ standards committee; if you did so, and built a compiler using that, nobody sane would use it. There's too much invested in existing C++ applications to switch for convenience of the guys building parsers; besides, their pain is over and existing parsers work fine.

You may want to write your own parser. OK, that's fine; just don't expect the rest of the community to let you change the language they must use to make it easier for you. They all want it easier for them, and that's to use the language as documented and implemented.

Community
  • 1
  • 1
Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • Good answer. See also D and C+, which try to resolve some of these issues. s/content/contend/ – david.pfx Oct 12 '14 at 23:02
  • @david.pfx: I know about D; it also tries to clean up the langauge semantics, too. Didn't know about C+. Do know, that in spite of the existence of these, the C++-as-is community doesn't seem to be in any danger of becoming extinct. – Ira Baxter Oct 12 '14 at 23:05
  • 3
    I've read Life After Parsing before and found it to be a real eye-opener; it made it clear to me that there's much more work in semantic analysis (name/type resolution, ...) than there is in parsing. I'm *not* trying to change the syntax of any language. I *do* want to understand what the properties are of a language in which you can do the syntactic analysis first and then the semantic analysis. C is not such a language (needs lexer hack); I always thought that Java was and I want to know why. – Daniel Shapero Oct 12 '14 at 23:52
  • 1
    @Korrok: read my answer about building Java/C++ with GLR parsers. *You don't need any lexer hack*. So, the distinction is in the mind of people that are using the wrong parsing technology. ... Granted, building a full C++ front end (esp. C++14, which we have done) is harder than than doing Java8, but they're both hard (in terms of effort and paying attention to detail) and parsing is the easiest piece. – Ira Baxter Oct 13 '14 at 00:00
  • @korrok: I have built a full scale commercial compiler (and a number of little ones). People have asked why I don't use YACC or LEX and my answer is: why bother? Lexing and parsing are so easy compared to what comes next I've never seen a need. Listen to Ira: he speaks true. – david.pfx Oct 13 '14 at 03:22
  • Re: "there's always the problem of deciding whether '...>>' closes off two templates or is a 'right shift operator'": That would be difficult to do in the lexer, but if the lexer emits `>>`, it's straightforward for the parser to convert that to `>` `>` if necessary, since in Java (unlike in C++), the arguments of a template are always types, never values. A trickier example is something like `foo = ( bar <`, where we may potentially need a great deal of lookahead to distinguish the parse found in (e.g.) `foo = (bar) baz;` from that found in (e.g.) `foo = (bar < baz) ? 0 : 1;`. – ruakh Oct 13 '14 at 03:45
  • @ruakh: What our Java lexers do is emit two tokens ">" ">" when ">>" is encountered. A shift operators is these two tokens encountered serially, with a semantic check that the second starts in the column after the first one. So the parser has no need to "convert" a ">>" token. We handle the (local) ambiguity by using a GLR parser which is happy to explore all possible valid parses (efficiently) to find the ones that actually work. – Ira Baxter Oct 13 '14 at 04:15
  • @david.pfx: (Thanks for the vote of confidence). We don't use LEX or YACC either, but for completely different reasons. My guess is you use a recursive descent, hand-written parser; these are relatively easy to build and more importantly are easy to bend to handle strange parsing issues. We use GLR and pure BNF grammars, and never have to bend the parser but do use extra semantic checks to kill inappropriate reductions (see my note about ">" ">" above). This turns out to be amazingly powerful; our C++ front end is entirely driven from a derivative of the grammar in the C++ standard. ... – Ira Baxter Oct 13 '14 at 04:18
  • @david.pfx ... Using a real grammar lets us do lot of other interesting things for "Life Beyond Parsing", such as building symbol tables, constructing control and data flow graphs, by leverage explicit attribute grammars coded using the base language grammar. So by NOT using recursive descent, we get additional help for later parts of the front end analyzers. – Ira Baxter Oct 13 '14 at 04:19
  • Interesting. I didn't know operators and generics can appear in the same place in Java. – Jörg W Mittag Oct 13 '14 at 11:20
  • Ah, I just realized you're talking about the lexer. – Jörg W Mittag Oct 13 '14 at 11:21
  • 1
    I agree about your "Life after Parsing": e.g. overload resolution in C# can encode any 3-SAT problem and is thus NP-hard. – Jörg W Mittag Oct 13 '14 at 11:22
  • @JörgWMittag: Any 3SAT problem? Wow. You have a reference for this fact? – Ira Baxter Oct 13 '14 at 12:53
  • 1
    http://blogs.msdn.com/b/ericlippert/archive/2007/03/28/lambda-expressions-vs-anonymous-methods-part-five.aspx – Jörg W Mittag Oct 13 '14 at 12:57
  • @IraBaxter: Not many of us get to build a C++ compiler, and elsewhere very few languages require an LR parser. My wish if at all is for tools to make simple languages so trivial to compile that every application will have a compiler or two. But that would be another question... – david.pfx Oct 13 '14 at 14:11
  • C is **not** ambiguous. It is ambiguous **without context**. That does not make it 'inherently ambiguous'. – Miles Rout Oct 14 '14 at 03:10
  • The *language* C is not ambiguous. AFAIK, all context-free *grammars* for it are. – Ira Baxter Oct 14 '14 at 03:13
  • @IraBaxter: I wonder how many ambiguities could have been averted by slightly changing syntax rules when using new features? For example, adding a colon between a type specifier and the identifier to be declared, but making it optional in cases where the is a type marked by a reserved word like `int` or `struct`, would have allowed `{x*:y;}` to be parsed as a declaration of an object `y` of type `x`, without having to know whether the enclosing scope has numerical objects named `x` and `y`. It would also have improved things for humans, since `int:*p,q;` and `int*:p,q;` would each... – supercat Apr 04 '20 at 19:38
  • ...make the type of `q` much clearer than a declaration like `int* p,q;`. – supercat Apr 04 '20 at 19:39