53

What are some good tools for getting a quick start for parsing and analyzing C/C++ code?

In particular, I'm looking for open source tools that handle the C/C++ preprocessor and language. Preferably, these tools would use lex/yacc (or flex/bison) for the grammar, and not be too complicated. They should handle the latest ANSI C/C++ definitions.

Here's what I've found so far, but haven't looked at them in detail (thoughts?):

  • CScope - Old-school C analyzer. Doesn't seem to do a full parse, though. Described as a glorified 'grep' for finding C functions.
  • GCC - Everybody's favorite open source compiler. Very complicated, but seems to do it all. There's a related project for creating GCC extensions called GEM, but hasn't been updated since GCC 4.1 (2006).
  • PUMA - The PUre MAnipulator. (from the page: "The intention of this project is to provide a library of classes for the analysis and manipulation of C/C++ sources. For this purpose PUMA provides classes for scanning, parsing and of course manipulating C/C++ sources."). This looks promising, but hasn't been updated since 2001. Apparently PUMA has been incorporated into AspectC++, but even this project hasn't been updated since 2006.
  • Various C/C++ raw grammars. You can get c-c++-grammars-1.2.tar.gz, but this has been unmaintained since 1997. A little Google searching pulls up other basic lex/yacc grammars that could serve as a starting place.
  • Any others?

I'm hoping to use this as a starting point for translating C/C++ source into a new toy language.

Thanks! -Matt

(Added 2/9): Just a clarification: I want to extract semantic information from the preprocessor in addition to the C/C++ code itself. I don't want "#define foo 42" to disappear into the integer "42", but remain attached to the name "foo". This, unfortunately, excludes several solutions that run the preprocessor first and only deliver the C/C++ parse tree)

leppie
  • 115,091
  • 17
  • 196
  • 297
Matt Ball
  • 1,434
  • 2
  • 17
  • 24
  • Matt, I think that's kind of a forlorne hope then; the preprocessor by definition works on the source BEFORE it get to the analysis. At least the old pipeline compilers had the prepoc'd source in a pipe before parsing. by the first pass. Maybe you could use the cpp embedded comments? – Charlie Martin Feb 09 '09 at 22:51
  • You could run your own processor on the source. It would output an an annotated source. You would need to modify the C++ grammar your tool would use to read in these annotations. Hey C++ is involved, you know this wasn't going to be easy :) – Sean McCauliff Feb 10 '09 at 06:47
  • 2
    Viewed 42,000 times? I think this should be re-opened. If you, the read agree, then click "re-open" above. – Ira Baxter Sep 22 '16 at 17:09
  • I believe that this question should be re-opened. All "best practice related questions" are marked as off topic, but some might have technical dimension, objective reasons; this is not a subjective, personal problem. – tolga May 25 '19 at 08:10

14 Answers14

37

Parsing C++ is extremely hard because the grammar is undecidable. To quote Yossi Kreinin:

Outstandingly complicated grammar

"Outstandingly" should be interpreted literally, because all popular languages have context-free (or "nearly" context-free) grammars, while C++ has undecidable grammar. If you like compilers and parsers, you probably know what this means. If you're not into this kind of thing, there's a simple example showing the problem with parsing C++: is AA BB(CC); an object definition or a function declaration? It turns out that the answer depends heavily on the code before the statement - the "context". This shows (on an intuitive level) that the C++ grammar is quite context-sensitive.

Adam Rosenfield
  • 390,455
  • 97
  • 512
  • 589
  • Nice link of Yossi. Very refreshing blog! – Ketan Feb 09 '09 at 04:07
  • 16
    As so many of his quotes, this one is wrong too. "Undecidable" means you cannot tell *at all* what A B (C); means. In reality, with context you know if and how A, B and/or C are defined earlier. This makes it trivally decidable. You just need to know if C is a type or an expression. – MSalters Feb 09 '09 at 10:12
  • And this "with context you know if and how A, B and/or C are defined earlier" has been countered already: [http://yosefk.com/c%2B%2Bfqa/web-vs-c++.html#misfeature-2], [http://www.reddit.com/r/programming/comments/5z7jr/c_frequently_questioned_answers/c02bmog]; – Nietzche-jou Feb 09 '09 at 12:34
  • 5
    He should have used "AA * BB(CC)". This can be (1) a function declaration, (2) an object declaration or (3) a multiplication. – Richard Corden Feb 10 '09 at 10:12
  • 7
    Doesn't really matter. Yossi assumes (incorectly) that to be _decidable_, any grammar must be parseable by a two-stage parser without feedback from the second stage into the first. For his second argument (infinite recursion) the same applies. "Indecidable" would mean no parser can detect recursion – MSalters Feb 10 '09 at 14:22
  • This type of discussion is exactly what's motivating my question. I want a new language that is isomorphically equivalent to C++ but is easy to statically analyze and lex from and editor. – Matt Ball Feb 10 '09 at 14:26
  • I don't want to put words into Yossi's mouth, but what I think he's trying to say is that the problem of determining whether a given C++ program is _syntactically_ valid is impossible without using semantic information. This is not true of most other languages. – Adam Rosenfield Feb 10 '09 at 15:52
  • @Adam Rosenfield: Really? How do you define _syntactically valid_? What do you think about C and Perl in this regard? – jpalecek Aug 07 '10 at 23:21
  • 12
    Yossi seems to be severely confused about what _undecidable_, _context-sensitive_ and _ambiguous_ means. – jpalecek Aug 07 '10 at 23:22
  • @jpalecek: A program is syntactically valid if it is well-formed according to the language rules (where in this case, _well-formed_ is defined by the C/C++ ISO standards). One way to test for well-formedness is to compile the source file with a standards-conforming compiler (in standards mode). If compilation succeeds, it is well-formed; otherwise, it is not. The command `gcc -ansi -c source.c && echo "Well-formed" || echo "Not well-formed"` is such a test for C. Perl is the same way, just use the Perl standard instead of the C standard. – Adam Rosenfield Aug 11 '10 at 04:34
  • 1
    @Adam Rosenfield: I agree with this definition; however, this means that your proposition "determining whether a given XXX program is syntactically valid is impossible without using semantic information" is true for any language that requires variables to be declared, that is C, C++, Pascal, Perl, Java, etc. – jpalecek Aug 11 '10 at 14:05
  • @jpalecek: Let me take a step backwards and redefine _syntactically valid_. Let _well-formed_ mean the same thing as before. Define syntactically valid to mean that the program source can be uniquely parsed into an abstract syntax tree (AST) according to the language's rules. With this definition, being syntactically valid is a weaker condition than being well-formed, since the program `main() { return x; }` is syntactically valid but not well-formed, since `x` is undefined. If a language has a decidable syntax, you can parse it and compile it in two completely separate steps. – Adam Rosenfield Aug 12 '10 at 04:35
  • (continuing) C and C++ do not have decidable syntaxes in this respect (though C is very close). In C, the statement `a * b;` cannot be decidably parsed without any semantic information: it could be a variable declaration of a pointer (if `a` is a type name) or a multiplication (one that does not do anything useful, but it is valid nonetheless). I don't know for sure, but I believe that Scheme, Common Lisp, and other Lisp variants are decidable languages. – Adam Rosenfield Aug 12 '10 at 04:45
  • 1
    I'm a little late to this thread, but none of this matters. One can build parsers for C++ that produce abstract syntax trees possibly with some ambiguous subtrees using GLR or GLL parsers. And that takes care of the problem of "parsing C++". It is only hard if you use a weak parsing technology. True, you need to resolve any ambiguities somehow, e.g., with information about symbol types, but you can do that at a later time, or you can do it in the semantic check for the troublesome productions (this latter being essentially the hack that GCC used when its C++ parser was LALR based). – Ira Baxter Jun 01 '12 at 10:07
  • In practical terms, context-sensitivity of the *language spec* (e.g. a grammar) doesn't always imply context sensitivity of the particular implementation of a *parser*. A very simple example is that of a lexer that performs symbol table look-ups and returns tokens that are classified by what symbol table they appear in. You can then write a context-free parser, yet the end product has context-sensitive grammar. – Kuba hasn't forgotten Monica Jan 18 '16 at 05:08
  • This SO answer shows specifically that one can parse C++ without any type information: http://stackoverflow.com/a/37506227/120163 – Ira Baxter Jun 04 '16 at 21:05
21

You can look at clang that uses llvm for parsing.

Support C++ fully now link

epatel
  • 45,805
  • 17
  • 110
  • 144
  • 4
    Update: "Clang currently implements all of the ISO C++ 1998 standard (including the defects addressed in the ISO C++ 2003 standard) except for 'export' (which has been removed from the C++'0x draft) and is considered a production-quality C++ compiler" Date: 2011-07-27 http://clang.llvm.org/cxx_status.html – Grzegorz Wierzowiecki Jul 30 '11 at 13:08
17

The ANTLR parser generator has a grammar for C/C++ as well as the preprocessor. I've never used it so I can't say how complete its parsing of C++ is going to be. ANTLR itself has been a useful tool for me on a couple of occasions for parsing much simpler languages.

Sean McCauliff
  • 1,494
  • 1
  • 13
  • 26
  • Mod up for mentioning ANTLR. I had looked at this a little while back, but forgot about using it as a lex/yacc replacement. If the C/C++ grammar is good, this may be my favorite path... – Matt Ball Feb 09 '09 at 12:57
  • 12
    I don't know why this is the accepted answer now, or why it was accepted originally. The ANTLR grammar for C++ has never been used in practice, as far as I know and I keep track of stuff like this. The author of the grammar left footprints in the docs saying, "Its incomplete, I'm done with it, you can patch it up if you want". C++98 is a tough language, and C++11 is worse, and then there's a bunch of dialects (GCC, Microsoft, Sun, ...). If you don't have the parser right, what you have is just useless. Then you need full name and type resolution to do anything. Nothing here for that. – Ira Baxter May 08 '12 at 22:25
16

Depending on your problem GCCXML might be your answer. Basically it parses the source using GCC and then gives you easily digestible XML of parse tree. With GCCXML you are done once and for all.

Łukasz Lew
  • 48,526
  • 41
  • 139
  • 208
  • 1
    Since it doesn't actually dump templates (only template instantiations) it's quite severly lacking, especially in the one area that's causing most parsing problems. See e.g. the keyword 'typename' inside templates. – MSalters Feb 09 '09 at 10:17
  • 1
    This is a very good link and suggestion, but in my particular case it doesn't quite work because I need to extract semantic information from the preprocessor. GCCXML operates on the resulting tree after the preprocessing magic is done. Also, it looks like this project hasn't been updated recently. – Matt Ball Feb 10 '09 at 14:21
  • 1
    gccxml is quite old now (2004!). Wish they'd update it! – Nick May 11 '09 at 13:37
14

pycparser is a complete parser for C (C99) written in Python. It has a fully configurable AST backend, so it's being used as a basis for any kind of language processing you might need.

Doesn't support C++, though. Granted, it's much harder than C.


Update (2012): at this time the answer, without any doubt, would be Clang - it's modular, supports the full C++ (with many C++-11 features) and has a relatively friendly code base. It also has a C API for bindings to high-level languages (i.e. for Python).

Eli Bendersky
  • 263,248
  • 89
  • 350
  • 412
8

Have a look at how doxygen works, full source code is available and it's flex-based.

A misleading candidate is GOLD which is a free Windows-based parser toolkit explicitly for creating translators. Their list of supported languages refers to the languages in which one can implement parsers, not the list of supported parse grammars.

They only have grammars for C and C#, no C++.

albert
  • 8,285
  • 3
  • 19
  • 32
Andy Dent
  • 17,578
  • 6
  • 88
  • 115
  • I'm hoping to use non-Windows platforms (Mac, Linux, or Solaris), but I do have a Windows system. I've used Doxygen before, and would like to take a closer look under the hood. – Matt Ball Feb 10 '09 at 14:28
  • My understanding is that Gold is an LALR parser generator. That won't parse C++. – Ira Baxter Jun 24 '09 at 03:49
  • Dammit, you're right. Their list of "supported languages" is about the languages which can call the parser, not the parseable languages. – Andy Dent Jun 25 '09 at 05:25
7

Parsing C++ is a very complex challenge.

There's the Boost/Spirit framework, and a couple of years ago they did play with the idea of implementing a C++ parser, but it's far from complete.

Fully and properly parsing ISO C++ is far from trivial, and there were in fact many related efforts. But it is an inherently complex job that isn't easily accomplished, without rewriting a full compiler frontend understanding all of C++ and the preprocessor. A pre-processor implementation called "wave" is available from the Spirit folks.

That said, you might want to have a look at pork/oink (elsa-based), which is a C++ parser toolkit specifically meant to be used for source code transformation purposes, it is being used by the Mozilla project to do large-scale static source code analysis and automated code rewriting, the most interesting part is that it not only supports most of C++, but also the preprocessor itself!

On the other hand there's indeed one single proprietary solution available: the EDG frontend, which can be used for pretty much all C++ related efforts.

Personally, I would check out the elsa-based pork/oink suite which is used at Mozilla, apart from that, the FSF has now approved work on gcc plugins using the runtime library license, thus I'd assume that things are going to change rapidly, once people can easily leverage the gcc-based C++ parser for such purposes using binary plugins.

So, in a nutshell: if you the bucks: EDG, if you need something free/open source now: else/oink are fairly promising, if you have some time, you might want to use gcc for your project.

Another option just for C code is cscout.

none
  • 5,701
  • 28
  • 32
6

The grammar for C++ is sort of notoriously hairy. There's a good thread at Lambda about it, but the gist is that C++ grammar can require arbitrarily much lookahead.

For the kind of thing I imagine you might be doing, I'd think about hacking either Gnu CC, or Splint. Gnu CC in particular does separate out the language generation part pretty thoroughly, so you might be best off building a new g++ backend.

Charlie Martin
  • 110,348
  • 25
  • 193
  • 263
  • Good to hear from you Charlie, and in such a random place! My main motivation is that C++ is hairy, and is hard to statically analyze and provide code-sense while editing. I'd like a new language that is easy to analyze but is isomorphically equivalent to C++ – Matt Ball Feb 10 '09 at 14:25
  • 1
    You're not going to get "easy to analyze" and "isomorphic to C++". C++ regardless of its syntax is hard to analyze. The best you can hope for is some kind of analysis tools for C++ itself. – Ira Baxter Jul 04 '09 at 18:56
4

Actually, PUMA and AspectC++ are still both actively maintained and updated. I was looking into using AspectC++ and was wondering about the lack of updates myself. I e-mailed the author who said that both AspectC++ and PUMA are still being developed. You can get to source code through SVN https://svn.aspectc.org/repos/ or you can get regular binary builds at http://akut.aspectc.org. As with a lot of excellent c++ projects these days, the author doesn't have time to keep up with web page maintenance. Makes sense if you've got a full time job and a life.

Brett Rossier
  • 3,420
  • 3
  • 27
  • 36
3

See our C++ Front End for a full-featured C++ parser: builds ASTs, symbol tables, does name and type resolution. You can even parse and retain the preprocessor directives. The C++ front end is built on top of our DMS Software Reengineering Toolkit, which allows you to use that information to carry out arbitrary source code changes using source-to-source transformations.

DMS is the ideal engine for implementing such a translator.

Having said that, I don't see much point in your imagined task; I don't see much value in trying to replace C++, and you'll find building a complete translator an enormous amount of work, especially if your target is a "toy" language. And there is likely little point in parsing C++ using a robust parser, if its only purpose is to produce an isomorphic version of C++ that is easier to parse (wait, we postulated a robust C++ already!).

EDIT May 2012: DMS's C++ front end now handles GCC3/GCC4/C++11,Microsoft VisualC 2005/2010. Robustly.

EDIT Feb 2015: Now handles C++14 in GCC and MS dialects.

EDIT August 2015: Now parses and captures both the code and the preprocessor directives in a unified tree.

EDIT May 2020: Has been doing C++17 for the past few years. C++20 in process.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • 5
    Repeat answer from another source of question http://stackoverflow.com/questions/792454/most-effective-way-to-parse-c-like-definition-strings...shameless plug for promotion of commercial software.... – t0mm13b Jan 24 '10 at 12:07
  • 1
    Repeated class of question: "How can I build a complicated langauge processor (easily)?" a) you can't, b) this engine is designed to make do this as easy as practical (I didn't say easy). – Ira Baxter Jan 24 '10 at 16:41
  • 4
    Quoting the OP: `In particular, I'm looking for open source tools` - is DMS open source?. – Morten Jensen Nov 13 '15 at 13:46
  • DMS isn't open source, that is clearly indicated by the SO-approved phrasing "our XXXX" even if you don't happen to know that. I didn't see any point in addressing the open-source tools as others have already done here. Pretty much they are complete failures at OP's intended task, so he'd have to look elsewhere, e.g., commercial. We're the only commercial (or in fact any kind of) tool I know that could have satisfied his need, *including* the ability to capture preprocessor directives so the answer IMHO was relevant. – Ira Baxter Jan 13 '22 at 15:34
  • Second, while my answer pointed out one of the only real solutions for OP, my answer was focused more on his intentions and desired results. He isn't realistically going to build an alternative a) because he doesn't have the energy to do it, and b) because it has no chance in hell of displacing real C++. [Yes, this response is rather late, but Morton's comment bothered me.] – Ira Baxter Jan 13 '22 at 15:37
3

how about something easier to comprehend like tiny-C or Small C

Scott Evernden
  • 39,136
  • 15
  • 78
  • 84
3

Elsa beats everything else I know hands down for C++ parsing, even though it is not 100% compliant. I'm a fan. There's a module that prints out C++, so that may be a good starting point for your toy project.

user52875
  • 3,020
  • 22
  • 21
  • When I tried it on my C++ files and stopped saying this is not implemented or something like that. – Aftershock Jun 17 '10 at 07:28
  • These tools appear based on dates at the site to have last been updated in 2005; the author claims "attempts to parse C++ (as) defined by the C++03 spec". It depends on something else to do preprocessing. – Ira Baxter Mar 01 '15 at 09:06
  • Ira, today I'd recommend clang without doubt. It wasn't good enough in 2009 yet. – user52875 Mar 03 '15 at 08:54
1

What about using a tool like GNU's CFlow, that can analyse the code and produce charts of call-graphs, here's what the opengroup(man page) has to say about cflow. The GNU version of cflow comes with source, and open source also ...

Hope this helps, Best regards, Tom.

t0mm13b
  • 34,087
  • 8
  • 78
  • 110
1

A while back I attempted to write a tool that will automatically generate unit tests for c files.

For preprosessing I put the files thru GCC. The output is ugly but you can easily trace where in the original code from the preprocessed file. But for your needs you might need somthing else.

I used Metre as the base for a C parser. It is open source and uses lex and yacc. This made it easy to get up and running in a short time without fully understanding lex & yacc.

I also wrote a C app since the lex & yacc solution could not help me trace functionality across functions and parse the structure of the entire function in one pass. It became unmaintainable in a short time and was abandoned.

Gerhard
  • 6,850
  • 8
  • 51
  • 81