62

Google didn't turn up anything that seemed relevant.

I have a bunch of existing, working C++ code, and I'd like to use python to crawl through it and figure out relationships between classes, etc.

EDIT: Just wanted to point out: I don't think I need or want to parse every bit of C++; I just need something smart enough to pick up on class, function and member variable declarations, and to skip over function definitions.

csbrooks
  • 745
  • 1
  • 7
  • 9
  • You pretty much can't do this without a full C++ parser. – Ira Baxter Apr 21 '10 at 07:12
  • 2
    If you're okay with it not catching the 0.1% edge cases, you might well be able to get away with regex parsing. I'm pretty sure a lot of text editors do this for their syntax highlighting / parsing. For example, Sublime Text comes with regex parsing files for a bunch of languages including C++ (see C++.tmLanguage). – Ben Hoyt Jul 27 '11 at 14:45

13 Answers13

49

Not an answer as such, but just to demonstrate how hard parsing C++ correctly actually is. My favorite demo:

template<bool> struct a_t;

template<> struct a_t<true> {
    template<int> struct b {};
};

template<> struct a_t<false> {
    enum { b };
};

typedef a_t<sizeof(void*)==sizeof(int)> a;

enum { c, d };
int main() {
    a::b<c>d; // declaration or expression?
}

This is perfectly valid, standard-compliant C++, but the exact meaning of commented line depends on your implementation. If sizeof(void*)==sizeof(int) (typical on 32-bit platforms), it is a declaration of local variable d of type a::b<c>. If the condition doesn't hold, then it is a no-op expression ((a::b < c) > d). Adding a constructor for a::b will actually let you expose the difference via presence/absence of side effects.

Pavel Minaev
  • 99,783
  • 25
  • 219
  • 289
40

C++ is notoriously hard to parse. Most people who try to do this properly end up taking apart a compiler. In fact this is (in part) why LLVM started: Apple needed a way they could parse C++ for use in XCode that matched the way the compiler parsed it.

That's why there are projects like GCC_XML which you could combine with a python xml library.

Some non-compiler projects that seem to do a pretty good job at parsing C++ are:

  • Eclipse CDT
  • OpenGrok
  • Doxygen
Stef
  • 6,729
  • 4
  • 34
  • 26
  • 2
    +1 - gcc-xml is the way to go, unless you want a paid (and expensive) solution like EDG frontend. – Pavel Minaev Sep 18 '09 at 22:02
  • 6
    Note that gcc-xml does not parse everything. Specifically, function bodies are not parsed. – liori Jul 03 '11 at 11:36
  • 1
    You should also consider swig , I had good time with https://github.com/kamanashisroy/swig csv module. – KRoy Jan 09 '20 at 21:25
7

For many years I've been using pygccxml, which is a very nice Python wrapper around GCC-XML. It's a very full featured package that forms the basis of some well used code-generation tools out there such as py++ which is from the same author.

Rod
  • 52,748
  • 3
  • 38
  • 55
jkp
  • 78,960
  • 28
  • 103
  • 104
5

You won't find a drop-in Python library to do this. Parsing C++ is fiddly, and few parsers have been written that aren't part of a compiler. You can find a good summary of the issues here.

The best bet might be clang, as its C++ support is well-established. Though this is not a Python solution, it sounds as though it would be amenable to re-use within a Python wrapper, given the emphasis on encapsulation and good design in its development.

jlarcombe
  • 699
  • 5
  • 6
4

Pycparser is a complete and functional parser for ANSI C. Perhaps you can extend it to c++ :-)

user1741137
  • 4,949
  • 2
  • 19
  • 28
Eli Bendersky
  • 263,248
  • 89
  • 350
  • 412
  • 5
    "Perhaps you can extend it to c++" and just how much work would that be? – Kevin Kostlan Aug 06 '14 at 17:11
  • 4
    @KevinKostlan: Too much, hence the smiley... I can't honestly recommend this route of action today. I'd use Clang bindings instead. – Eli Bendersky Aug 06 '14 at 17:50
  • 1
    @EliBendersky I found this article to be helpful (http://eli.thegreenplace.net/2011/07/03/parsing-c-in-python-with-clang) ;) – Klik Feb 06 '17 at 07:52
2

If you've formatted your comments in a compatible way, doxygen does a fantastic job. It'll even draw inheritance diagrams if you've got graphviz installed.

For example, running doxygen over the following:

/// <summary>
/// A summary of my class
/// </summary>
public class MyClass
{
protected:
    int m_numOfWidgets; /// Keeps track of the number of widgets stored

public:
    /// <summary>
    /// Constructor for the class.
    /// </summary>
    /// <param paramName="numOfWidgets">Specifies how many widgets to start with</param>
    MyClass(int numOfWidgets)
    {
        m_numOfWidgets = numOfWidgets;
    }

    /// <summary>
    /// Increments the number of widgets stored by the amount supplied.
    /// </summary>
    /// <param paramName="numOfWidgets">Specifies how many widgets to start with</param>
    /// <returns>The number of widgets stored</returns>
    IncreaseWidgets(int numOfWidgetsToAdd)
    {
        m_numOfWidgets += numOfWidgets;
        return m_numOfWidgets;
    }
};

Will turn all those comments into entries in .html files. With more complicated designs, the result is even more beneficial - often much easier than trying to browse through the source.

albert
  • 8,285
  • 3
  • 19
  • 32
Jon Cage
  • 36,366
  • 38
  • 137
  • 215
1

This page shows a C++ grammar written in Antlr, and you can generate Python code from it.

There also seems to be someone who was working on a C++ parser in pyparsing, but I was not able to find out who or its current status.

Kathy Van Stone
  • 25,531
  • 3
  • 32
  • 40
  • 2
    It's not possible to have a fully working C++ grammar in ANTLR, or, indeed, virtually any other grammar description language. C++ grammar is not context-free. Due to things like template metaprogramming, parsing C++ effectively requires writing an interpreter of a Turing-complete language just to be able to distinguish variable declarations from expressions. – Pavel Minaev Sep 18 '09 at 22:04
  • @Pavel: You can have a perfectly fine C++ parser using context-free grammar rules, if you have a decent parser. You don't have to resolve names and types during parsing; see the DMS Toolkit answer for a full C++ parser that does exactly what you say can't be done. – Ira Baxter Apr 21 '10 at 07:14
  • 1
    @Ira: in some contexts, if you don't resolve the type, you don't know what something is. For example, consider: `a::bd`, where `a` is another class template with specializations, one of which defines `b` as another class template, and the second one defines `b` as enum member. Depending on which specialization is picked (i.e. on size of `int`), the whole thing is either a declaration of variable `d`: `a::bd` - or it is an expression: `a::b < c > d`. So now we have perfectly conformant ISO C++ code which is al so implementation-dependent. – Pavel Minaev Apr 22 '10 at 15:58
  • Now, if that code is inside a class template, and `sizeof(int)` is replaced by `sizeof(T)` - or some even more complex compile-time expression that is ultimately dependent on a template parameter - you'll have to completely evaluate that expression in order to produce unambiguous output. Since said expression can use all TMP tricks in the book, you'll have to write code to fully process C++ template instantiations, complete with specializations, function overloading rules (consider `a`), and so on. And if you just report ambiguity - well, that isn't a "full C++ parser"... – Pavel Minaev Apr 22 '10 at 16:01
  • @Pavel: A pure *parser* for C++ can built ASTs just fine without doing any name/type resolution. You're correct in that the ASTs have to capture the ambiguity where the pure syntax rules can't distinguish (and DMS does that). One can resolve the ambiguity in the parse trees to produce a final clean AST by a later pass (and that's how DMS does it). The advantage is that pure parsing and symbol table resolution are kept as separate, modular passes and that makes it far easier to build a working "full C++ parser". The ANTLR version tangles these together, making it too complex to be reliable. – Ira Baxter Apr 23 '10 at 05:14
  • @Pavel: This isn't a theory answer. DMS has been used to carry out complex transformations on two large-scale C++ systems, using its C++ parser. It really does work fine, even for the kinds of examples you suggest. – Ira Baxter Apr 23 '10 at 05:56
  • Ira: You make me very curious about DMS – Viet Mar 05 '13 at 00:29
1

There is no (free) good library to parse C++ in any language.
Your best choices are probably Dehydra g++ plugin, clang, or Elsa.

Employed Russian
  • 199,314
  • 34
  • 295
  • 362
1

Here's a SourceForge project that claims to parse c++ headers. As the other commenters have pointed out, there's no general solution, but you this sounds like it will do enough for your needs. (I just ran across it for a similar need and haven't tried it myself yet.)

http://sourceforge.net/projects/cppheaderparser/

Bill
  • 31
  • 4
0

I would keep an eye on the gcc.gnu.org/wiki/plugins as it seems like plugins are the way to go. Also the gcc-python-plugin seems like it has a nice implementation.

stephenmm
  • 2,640
  • 3
  • 30
  • 48
0

The pyparsing wiki shows this example - all it does is parse struct declarations, so this might give you just a glimpse at the magnitude of the problem.

I suggest you (or even better, your employer) shell out $200 and buy Enterprise Architect from sparxsystems. This software is amazingly powerful for the price, and includes pretty good code reverse engineering features. You will spend far more than this in your own time to only get about 2% of the job done. In this case, "buys" wins over "make".

PaulMcG
  • 62,419
  • 16
  • 94
  • 130
0

Ctypes uses gcc-xml for code generation. It's possible that cpptypes does also. Even if it doesn't, you could use gcc-xml to generate XML from your C++ file, then parse the xml with one of the built-in or third-party Python XML parsers.

Jason R. Coombs
  • 41,115
  • 10
  • 83
  • 93
0

The Clang project provides libraries for just parsing C++ code.

Either with Clang and GCC you can generate an XML representation of the code

If you prefer a more Pythonian solution you could also search for a C++ yacc grammar and use py-ply (Yacc for Python), but that seems the solution that needs more work

SystematicFrank
  • 16,555
  • 7
  • 56
  • 102