20

Is it theoretically up to the task?

Can it be done practically and would the resulting parser be used with sufficient performance and output (say, LLVM IR or GCC's gimple) to be integrated in a competing compiler?

Johannes Schaub - litb
  • 496,577
  • 130
  • 894
  • 1,212
rubenvb
  • 74,642
  • 33
  • 187
  • 332
  • 20
    Writing a C++ parser is never practical. – ildjarn May 19 '11 at 19:36
  • My guess is that Spirit can be used to parse anything you can describe in EBNF. Never saw a complete EBNF spec for C++ tho. –  May 19 '11 at 19:36
  • 3
    @Vlad: you've got me googling [link!](http://www.externsoft.ch/download/cpp-iso.html#syntax) – rubenvb May 19 '11 at 19:40

3 Answers3

16

I'm sorry. I talked to its author, and he said he won't make it parse C++ fully, but admits that he accepts it to parse certain constructs as ambiguous.

So this is not an answer anymore!!


I recommend you to have a look at scalpel. From its homepage

Scalpel stands for source code analysis, libre and portable library. This is a C++ library which aims to perform full syntax and semantic analysis of any given C++ program.

And

What makes me think Scalpel could be accepted into Boost

Scalpel uses itself several Boost libraries: Spirit, Wave, shared_ptr (now in C++0x's STL), Optional, Test, etc.. Actually, it exclusively uses Boost libraries and the C++ standard library, which is required by Boost.

Besides, Boost already provides a Spirit-based C++ source code preprocessing library: Wave. Including a C++ source code analysis library seems to be a natural evolution.

Johannes Schaub - litb
  • 496,577
  • 130
  • 894
  • 1,212
  • From the author of Scalpel: "I know there are some ambigous cases that a standalone syntax analyzer couldn't manage, but IMHO, they're way too rare and avoidable to be considered." So it won't be able to do all cases correctly. Source: near the end of http://groups.google.com/group/scalpel-users/browse_thread/thread/1231dcdd7d951009 – Sjoerd May 20 '11 at 23:43
  • @Sjoerd I have never tried to write a C++ analyzer either by hand or by support of a "standalone syntax analyzer", so I can't comment on that. Please note that scalpel is not a "standalone syntax analyzer", but it also does semantic analysis. So that statement doesn't necessarily apply to scalpel as a whole, but only to its syntax analyzer. I.e it could deem `id id;` a variable declaration first, and in semantic analysis, when it detects that `id` is a function type, change it to a function declaration. I don't know what he's doing behind the scenes. – Johannes Schaub - litb May 21 '11 at 10:03
  • 2
    UPDATE: In between I asked him what he's doing, and he said he's just ignoring all the ambiguities in the grammar :( So I guess his parser doesn't count as parsing C++ -.- – Johannes Schaub - litb Jun 10 '11 at 14:13
  • I wanted to use it too, but it does not process function bodies - so sad :( – Slaus Jun 28 '11 at 19:33
  • @rubenvb: You should really change the "accepted answer" away from this one. – Ira Baxter Mar 01 '15 at 09:23
5

No. C++ is too hard to parse for most automatic tools, and in practice usually is parsed by hand written parsers. [Edit 1-Mar-2015: Added 'most' and 'usually'.]

Among the hard problems are:

  • A * B; which could be either the definition of a variable B with type A* or just the multiplication of two variables A and B.
  • A < B > C > D Where does the template A<> end? The usual 'max-munch' rules for parsing expressions will not work here.
  • vector<shared_ptr<int>> where the >> ends two templates, which is hard to do with only one token (and a space in between is allowed). But in 1>>15 no space is allowed.

And I bet that this list is far from complete.

Addition: The grammar is available, but is ambiguous and thus not valid as input to tools like Spirit.

Update 1-Mar-2015: As Ira Baxter, a well known expert in this field, points out in the comments, there are some parser generators that can generate a parser that will generate the full parser forest. As far as I know, selecting the right parse still requires a semantic phase. I'm not aware of any non-commercial parser generators that can do so for C++'s grammar. For more information, see this answer.

Community
  • 1
  • 1
Sjoerd
  • 6,837
  • 31
  • 44
  • 3
    Parsing and interpreting are two different things. You can still parse `A * B` without knowing what A and B actually are. –  May 19 '11 at 19:44
  • 6
    Spirit is fully capable of contextual disambiguation via semantic actions. Who is voting up this nonsense? – ildjarn May 19 '11 at 19:57
  • 1
    @Vlad But then you would accept many invalid constructs as well. E.g. how do you avoid accepting `A*B*C;` when A is a typedef? – Sjoerd May 19 '11 at 19:57
  • 1
    @ildjarn GCC tried it for a long time (with bison), but switched to a hand-written recursive decent parser at some point. – Sjoerd May 19 '11 at 20:03
  • @Sjoerd : I'm not necessarily saying it's the ideal approach, but that's also not what the OP asked -- she asked if it was "theoretically up to the task", and the answer is quite clearly 'yes', especially considering Boost.Wave already has a complete C++03 Spirit.Classic grammar. – ildjarn May 19 '11 at 20:05
  • @ildjarn Why don't you turn that comment into an answer? – Sjoerd May 19 '11 at 20:08
  • @Sjoerd : Because the other half of the question relates to performance, which I don't know the answer to. :-[ The Spirit.Classic implementation is almost certainly *not* performant enough for use in a real compiler; I suspect a Spirit.Qi implementation would have sufficient performance, but I can't say that definitively. – ildjarn May 19 '11 at 20:11
  • 1
    @ildjarn Theoretically, one "any*" rule with enough semantic actions added would do. – Sjoerd May 19 '11 at 20:16
  • 1
    A bit late, but "C++ is too hard to parse with automatic tools" is just plain wrong. I'll agree that Spirit probably can't do it due to need to handle ambiguities, and arbitrary lookahead. But GLR parsers can do this just fine. See http://stackoverflow.com/a/1004737/120163 – Ira Baxter Feb 27 '15 at 15:11
  • @IraBaxter Thanks for your comment. I hope the edit has improved the answer. – Sjoerd Feb 28 '15 at 23:39
  • Yep. Agreed, you need a semantic pass to choose "the right" parse from the set of possible syntax parses. – Ira Baxter Mar 01 '15 at 01:36
2

For "any other language", I once tried creating a shell script parser with Spirit. It turned out to be theoretically possible (I believe it would work), but it was not compilable on a machine with 1 GB memory, so eventually I gave up.

jpalecek
  • 47,058
  • 7
  • 102
  • 144
  • You just needed to break up your grammar across multiple compilation units. Scalpel as mentioned above does so and can compile. – Raindog Nov 26 '13 at 07:14