Can a LL(*) parser (like antlr3) parse C++?

Question

I need to create a parser to C++ 14. This parser must be created using C++ in order to enable a legacy code generation reuse. I am thinking to implement this using ANTLR3 (because ANTLR4 doesn't target C++ code yet).

My doubt is if ANTLR3 can parse C++ since it doesn't use the Adaptive LL(*) algorithm like ANTLR4.

*"I need to create a parser to C++ 14."*... Why? Also, if you really need that, did you explore LLVM and its capabilities? — Nawaz, May 29 '16 at 03:13
You do realize that [C++'s grammar is unrestricted](http://stackoverflow.com/questions/14589346/is-c-context-free-or-context-sensitive), not context-free, right? — Cornstalks, May 29 '16 at 03:35
And grammatically correct input can produce invalid C++ code: `void m() { m++;}` — Rerito, May 29 '16 at 09:02
@Rerito: Pretty much all *parsers* accept "too much" (e.g., programs that look legal but are not due to context constraints and sometimes even due to grammar constraintst that are not honored by the specific parsing machinery). This means that after "raw" parsing, your parsing engine has to do further checking (e.g., type checking for your example) to eliminate the "too much". See my answer on parsing C++ vs. type checking: http://stackoverflow.com/a/37506227/120163 — Ira Baxter, May 29 '16 at 15:29

score 4 · Answer 1 · edited May 23 '17 at 11:52

Most classic parser generators cannot generate a parser that will parse a grammar for an arbitrary context free language. The restrictions of the grammars they can parse often gives rise to the name of the class of parser generators: LL(k), LALR, ... ANTLR3 is essentially LL; ANTLR4 is better but still not context free.

Earley, GLR, and GLL parser generators can parse context free languages, sometimes with high costs. In practice, Earley tends to be pretty slow (but see the MARPA parser generator used with Perl6, which I understand to be an Earley variant that is claimed to be reasonably fast). GLR and GLL seem to produce working parsers with reasonable performance.

My company has built about 40 parsers for real languages using GLR, including all of C++14, so I have a lot of confidence in the utility of GLR.

When it comes to parsing C++, you're in a whole other world, mostly because C++ parsing seems to depend on collecting symbol table information at the same time. (It isn't really necessary to do that if you can parse context-free).

You can probably make ANTLR4 (and even ANTLR3) parse C++ if you are willing to fight it hard enough. Essentially what you do is build a parser which accepts too much [often due to limitations of the parser generator class], and then uses ad hoc methods to strip away the extra. This is essentially what the hand-written GCC and Clang parsers do; the symbol table information is used to force the parser down the right path.

If you choose to go down this path of building your own parser, no matter which parser generator you choose, you will invest huge amounts of energy to get a working parser. [Been here; done this]. This isn't a good way to get on with whatever your intended task motivates this parser.

I suggest you get one that already works. (I've already listed two; you can find out about our parser through my bio if you want).

That will presumably leave you with a working parser. Then you want to do something with the parse tree, and you'll discover that Life After Parsing requires a lot of machinery that the parsers don't provide. Google the phrase to find my essay on the topic or check my bio.

Thank you Ira Baxter. I was using Bison to parse C++, but It doesn't enable accessing information of sub-nodes inside semantic predicates during parsing. ANTLR enable this, but my big question was about the parsing algorithm LL(*). I read that this have infinite lookahead. I am thinking about try ANTLR. — Cleverson Ledur, May 31 '16 at 17:14
Well, best of luck to you. Have you really considered the effort level needed to achieve success? — Ira Baxter, May 31 '16 at 17:26
I recognize that It will be a hard work. However, if I achieve success it will enable a set of researches and source-to-source transformations. My first objective is to save all C++ declarations and use this informations in semantic predicates to avoid ambiguities. I have already a (almost) working parser for C++ (developed using Bison), but this doesn't deal with ambiguities (when the parser doesn't know if a identifier is a type/declarator/etc...). Thank you again for the answer and sorry for the bad English. — Cleverson Ledur, May 31 '16 at 18:02
I don't know if you've checked my bio. *I've already built a tool that can apply source to source transformations to C++*, that handles all those ambiguities with full name resolution. Most would be users don't like the fact that this tool is commercial [We have probably 10 linear years of engineering in this, and that engineering isn't cheap]. You won't fare better if you build another commercial tool; I'd welcome the competition because it shows the idea is acceptable to the community. You may be considering building a free tool; where will you get the resources to do that? — Ira Baxter, May 31 '16 at 19:26

Can a LL(*) parser (like antlr3) parse C++?

1 Answers1