5

I'm trying to parse C++ code, and create an AST. What I want to do is extract some simple reflection information(class names, member variables and their types, etc..). I don't need to compile the code, or generate binaries. I am looking for the simplest possible way to do this. Ideally, I would like a small parser, in a single static library, with no dependencies.

I've been looking around, and it appears that a Bison parser may be able to do this for me. I've tried to find an open source parser, but all google will give me is C++ wrappers for bison, and not a bison parser for C++. Typing "C++ parser" also fails, by giving results for parsers for everything else, that are written in C++.

Is there an open source project that will do what I need?

CuriousGeorge
  • 7,120
  • 6
  • 42
  • 74
  • 7
    The clang frontend. C++'s grammar is not context-free; parsing and semantic analysis are connected and you do need most of a compiler. – Ben Voigt Jul 24 '14 at 04:54
  • No such thing as a "small C++ parser". The langauge is enormous. – Ira Baxter Jul 25 '14 at 04:07
  • @BenVoigt: Actually, various C++ grammars exist which be used to parse C++ in a context free way (see our tools using GLR parsers). You get ambiguity nodes where there is more than one syntactic interpretation, true, but it means you can parse well-formed C++ files without necessarily having all the definitions. (The preprocessor still gives trouble, but we have a way to handle that, too). – Ira Baxter Jul 25 '14 at 04:13
  • I'm sorry, I missed the fact that you asked only for open source. I am deleting my response describing a non-open source solution. – Ira Baxter Jul 27 '14 at 03:45
  • Bison isn't enough. It was possible for C++98 to make a parser with Bison and a lot of ugly hackery. C++11 is quite a bit more complicated, and it isn't clear that that even with a lot of hackery that Bison can really help (in particular, C++11 in some places seems to require huge lookaheads and Bison cannot do that). You really don't want to do this by hand either. – Ira Baxter Jul 31 '14 at 03:24
  • @Ira Thanks for the info. I also read the "x * y ;" example in one of your other answers, which helped me understand Ben's comment about C++ grammar not being context-free. So I understand that I can't just pick up any one file and extract type information from it. Also, I was overlooking the obvious need for a preprocessor. – CuriousGeorge Jul 31 '14 at 17:42
  • The X*Y example isn't about non-context-free *grammars*; it is about ambiguities in a context-free grammar. As a practical matter, you can only build parsers for a context-free grammar and address the non-context-freeness outside the grammar (yes, there are real context-sensitive parser generators that exist, but they aren't used seriously.) Bottom line: parsing C++ is just plain hard. – Ira Baxter Jul 31 '14 at 18:29
  • I'm not sure I understand exactly what you mean, but I do agree with your last point :) – CuriousGeorge Jul 31 '14 at 18:51

2 Answers2

11

clang can do this:

clang -Xclang -ast-dump -fsyntax-only test.cc

also see the docs.

perreal
  • 94,503
  • 21
  • 155
  • 181
  • Should probably mention this outputs to `stdout` and errors obviously to `stderr`. – Rapptz Jul 24 '14 at 05:02
  • Alternatively, try libclang python bindings - parsing the ast dump is not that easy. – SK-logic Jul 25 '14 at 15:45
  • At this point, clang seems like the most attractive option. I don't think it makes sense to try and parse the ast-dump as text when it's already stored in binary format inside the compiler, but I suppose the source of the ast-dump function is a good place to start. – CuriousGeorge Jul 31 '14 at 17:56
4

You can use GCC-XML to generate a fairly easy to parse XML representation of most (but not all) C++ code.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • I took at look at GCC-XML, and it seems like it may do what I want, but it appears to depend on having certain versions of other compilers installed on the target machine. As of right now, it only appears to support up to visual studio 2010, which is unacceptable. A quick glance at the website seems to indicate that it doesn't support clang either. – CuriousGeorge Jul 31 '14 at 17:49