1

I want to have a converter between this

#include <ololo.h>
#ifdef HAVE_QQQ
  #include <qqq.h>
#endif

char* ololize(char* s) {
   #ifdef HAVE_QQQ
      return qqq(s);
   #else
      return ololo(s);
   #endif
}

and something like this

(include_angular "ololo.h")
(p_ifdef "HAVE_QQQ"
 (include_angular "qqq.h"))
(define_function "ololize" [(ptr char) "s"] (ptr char)
 (p_ifdef "HAVE_QQQ"
  (return (qqq s))
  :else
  (return (ololo s)))))

I.e. representation of a source code as a easily manageable tree, not from compiler's point of view, but from programmer's point of view.

I don't expect 100% correct work, but it should work for most existing source files. Bonus points if I can "round-trip" the code to tree and back.

Are there any existing tools or libraries for that?

Agnius Vasiliauskas
  • 10,935
  • 5
  • 50
  • 70
Vi.
  • 37,014
  • 18
  • 93
  • 148
  • Looks like you want a [Recursive descent parser](http://en.wikipedia.org/wiki/Recursive_descent_parser), have fun with it! – Kninnug Apr 12 '13 at 11:30
  • 2
    Clang may have tools you can use: [CLang](http://clang.llvm.org) – Steve Fallows Apr 12 '13 at 11:50
  • Flex (http://flex.sourceforge.net/manual/) and Bison (http://www.gnu.org/software/bison/manual/bison.html)! – Josh Apr 12 '13 at 13:58
  • @Kninnug, Josh: Building a C++ parser that works on real code is extremely difficult, due to the sheer size of the langauge, and the variations among the vendors of such compilers. Recursive descent and Bison (in its LALR form) will have trouble with the ambiguous grammar just for starters. Kids: don't do this at home. – Ira Baxter Apr 13 '13 at 20:37
  • Possible duplicate - [AST from C code](http://stackoverflow.com/questions/239722/ast-from-c-code) – Agnius Vasiliauskas Apr 13 '13 at 20:46
  • @0x69: OP asked for C/C++ in his subject line, not just C. – Ira Baxter Apr 13 '13 at 21:01

1 Answers1

1

Our DMS Software Reengineering Toolkit and its C++ front end can do this. DMS provides language-precise parsing (including handling GCC and MS dialects as well as C++11), and builds ASTs. Depending on how it is configured, it can also build full symbol tables, and presently can do control flow analysis for C++ (but not quite yet for C++11).

From the internal AST, DMS can regenerate legal source that will produce the same compiled result, either prettyprinted or preserving space ("fidelity mode") almost exactly. We can also ask the AST be exported as XML.

For OP's small program, here is the AST rendered as XML directly from our GCC4 dialect parser (there's a "PrintASTasXML" function in the DMS libraries). Note the AST contains the INCLUDE and preprocessor conditionals.

<?xml version="1.1" encoding="UTF-8"?>
<!-- Using DMS PrintASTasXML (v.1.00) -->
<!-- XML generated on 2013/04/13 15:24:44 -->
<DMSForest>
  <tree node="translation_unit" type="2" domain="1" id="1iity" parents="0" line="1" column="1" file="1">
<tree node="declaration_seq" type="994" domain="1" id="1iitt" line="1" column="1" file="1">
  <tree node="declaration_seq" type="994" domain="1" id="1iepb" line="1" column="1" file="1">
    <tree node="control_line" type="2133" domain="1" id="1ieos" line="1" column="1" file="1">
      <tree node="'#'" type="2908" domain="1" id="1ieoi" literal="0" line="1" column="1" file="1"/>
      <tree node="'include'" type="2759" domain="1" id="1ieok" literal="0" line="1" column="2" file="1"/>
      <tree node="ANGLED_HEADER_NAME" type="2951" domain="1" id="1ieom" line="1" column="10" file="1">
    <literal>ololo.h</literal>
      </tree>
      <tree node="new_line" type="2946" domain="1" id="1ieoo" literal="0" line="1" column="19" file="1"/>
    </tree>
    <tree node="pp_declaration_seq" type="997" domain="1" id="1ieph" line="2" column="1" file="1">
      <tree node="if_directive" type="2113" domain="1" id="1iep4" line="2" column="1" file="1">
    <tree node="'#'" type="2908" domain="1" id="1iep1" literal="0" line="2" column="1" file="1"/>
    <tree node="'ifdef'" type="2756" domain="1" id="1ieov" literal="0" line="2" column="2" file="1"/>
    <tree node="IDENTIFIER" type="2646" domain="1" id="1ieol" line="2" column="8" file="1">
      <literal>HAVE_QQQ</literal>
    </tree>
    <tree node="new_line" type="2946" domain="1" id="1ieoz" literal="0" line="2" column="16" file="1"/>
      </tree>
      <tree node="control_line" type="2133" domain="1" id="1iepi" line="3" column="3" file="1">
    <tree node="'#'" type="2908" domain="1" id="1iep6" literal="0" line="3" column="3" file="1"/>
    <tree node="'include'" type="2759" domain="1" id="1iep8" literal="0" line="3" column="4" file="1"/>
    <tree node="ANGLED_HEADER_NAME" type="2951" domain="1" id="1iepa" line="3" column="12" file="1">
      <literal>qqq.h</literal>
    </tree>
    <tree node="new_line" type="2946" domain="1" id="1iepe" literal="0" line="3" column="19" file="1"/>
      </tree>
      <tree node="endif_directive" type="2117" domain="1" id="1ieoy" line="4" column="1" file="1">
    <tree node="'#'" type="2908" domain="1" id="1iepl" literal="0" line="4" column="1" file="1"/>
    <tree node="'endif'" type="2743" domain="1" id="1iepk" literal="0" line="4" column="2" file="1"/>
    <tree node="new_line" type="2946" domain="1" id="1iepn" literal="0" line="4" column="7" file="1"/>
      </tree>
    </tree>
  </tree>
  <tree node="function_definition" type="1616" domain="1" id="1iito" line="6" column="1" file="1">
    <tree node="function_head" type="1628" domain="1" id="1iiow" line="6" column="1" file="1">
      <tree node="simple_type_specifier" type="1104" domain="1" id="1iep9" line="6" column="1" file="1">
    <tree node="'char'" type="2723" domain="1" id="1iepd" literal="0" line="6" column="1" file="1"/>
      </tree>
      <tree node="ptr_declarator" type="1398" domain="1" id="1iio3" line="6" column="5" file="1">
    <tree node="ptr_operator" type="1436" domain="1" id="1iepq" line="6" column="5" file="1">
      <tree node="'*'" type="2903" domain="1" id="1iep7" literal="0" line="6" column="5" file="1"/>
    </tree>
    <tree node="noptr_declarator" type="1402" domain="1" id="1iioc" line="6" column="7" file="1">
      <tree node="IDENTIFIER" type="2646" domain="1" id="1iepm" line="6" column="7" file="1">
        <literal>ololize</literal>
      </tree>
      <tree node="'('" type="2887" domain="1" id="1iepr" literal="0" line="6" column="14" file="1"/>
      <tree node="parameter_declaration" type="1591" domain="1" id="1iion" line="6" column="15" file="1">
        <tree node="simple_type_specifier" type="1104" domain="1" id="1iioe" line="6" column="15" file="1">
          <tree node="'char'" type="2723" domain="1" id="1iio0" literal="0" line="6" column="15" file="1"/>
        </tree>
        <tree node="ptr_declarator" type="1398" domain="1" id="1iiom" line="6" column="19" file="1">
          <tree node="ptr_operator" type="1436" domain="1" id="1iiof" line="6" column="19" file="1">
        <tree node="'*'" type="2903" domain="1" id="1iio1" literal="0" line="6" column="19" file="1"/>
          </tree>
          <tree node="IDENTIFIER" type="2646" domain="1" id="1iio8" line="6" column="21" file="1">
        <literal>s</literal>
          </tree>
        </tree>
      </tree>
      <tree node="')'" type="2888" domain="1" id="1iiol" literal="0" line="6" column="22" file="1"/>
      <tree node="function_qualifiers" type="1418" domain="1" id="1iio2" line="6" column="24" file="1"/>
    </tree>
      </tree>
    </tree>
    <tree node="compound_statement" type="873" domain="1" id="1iitn" line="6" column="24" file="1">
      <tree node="'{'" type="2940" domain="1" id="1iiov" literal="0" line="6" column="24" file="1"/>
      <tree node="statement" type="853" domain="1" id="1iitw" line="7" column="4" file="1">
    <tree node="if_directive" type="2113" domain="1" id="1iipc" line="7" column="4" file="1">
      <tree node="'#'" type="2908" domain="1" id="1iip4" literal="0" line="7" column="4" file="1"/>
      <tree node="'ifdef'" type="2756" domain="1" id="1iip2" literal="0" line="7" column="5" file="1"/>
      <tree node="IDENTIFIER" type="2646" domain="1" id="1iip5" line="7" column="11" file="1">
        <literal>HAVE_QQQ</literal>
      </tree>
      <tree node="new_line" type="2946" domain="1" id="1iip0" literal="0" line="7" column="19" file="1"/>
    </tree>
    <tree node="jump_statement" type="984" domain="1" id="1iisg" line="8" column="7" file="1">
      <tree node="'return'" type="2780" domain="1" id="1iiox" literal="0" line="8" column="7" file="1"/>
      <tree node="$NONTERMINALAMBIGUITY" type="2999" nonterminalname="postfix_expression" nonterminaltype="402" domain="1" id="1iiou" children="2" line="8" column="14" file="1">
        <tree node="postfix_expression" type="380" domain="1" id="1iipi" line="8" column="14" file="1">
          <tree node="IDENTIFIER" type="2646" domain="1" id="1iip8" parents="2" line="8" column="14" file="1">
        <literal>qqq</literal>
          </tree>
          <tree node="'('" type="2887" domain="1" id="1iip6" parents="2" literal="0" line="8" column="17" file="1"/>
          <tree node="IDENTIFIER" type="2646" domain="1" id="1iipb" parents="2" line="8" column="18" file="1">
        <literal>s</literal>
          </tree>
          <tree node="')'" type="2888" domain="1" id="1iip1" parents="2" literal="0" line="8" column="19" file="1"/>
        </tree>
        <tree node="postfix_expression" type="368" domain="1" id="1iipk" line="8" column="14" file="1">
          <tree node="IDENTIFIER" type="2646" domain="1" id="1iip8" parents="2" alreadyprinted="true"/>
          <tree node="'('" type="2887" domain="1" id="1iip6" parents="2" alreadyprinted="true"/>
          <tree node="IDENTIFIER" type="2646" domain="1" id="1iipb" parents="2" alreadyprinted="true"/>
          <tree node="')'" type="2888" domain="1" id="1iip1" parents="2" alreadyprinted="true"/>
        </tree>
      </tree>
      <tree node="';'" type="2939" domain="1" id="1iisb" literal="0" line="8" column="20" file="1"/>
    </tree>
    <tree node="else_directive" type="2116" domain="1" id="1iisi" line="9" column="4" file="1">
      <tree node="'#'" type="2908" domain="1" id="1iism" literal="0" line="9" column="4" file="1"/>
      <tree node="'else'" type="2742" domain="1" id="1iisp" literal="0" line="9" column="5" file="1"/>
      <tree node="new_line" type="2946" domain="1" id="1iiso" literal="0" line="9" column="9" file="1"/>
    </tree>
    <tree node="jump_statement" type="984" domain="1" id="1iit5" line="10" column="7" file="1">
      <tree node="'return'" type="2780" domain="1" id="1iish" literal="0" line="10" column="7" file="1"/>
      <tree node="$NONTERMINALAMBIGUITY" type="2999" nonterminalname="postfix_expression" nonterminaltype="402" domain="1" id="1iio5" children="2" line="10" column="14" file="1">
        <tree node="postfix_expression" type="380" domain="1" id="1iit6" line="10" column="14" file="1">
          <tree node="IDENTIFIER" type="2646" domain="1" id="1iisk" parents="2" line="10" column="14" file="1">
        <literal>ololo</literal>
          </tree>
          <tree node="'('" type="2887" domain="1" id="1iisu" parents="2" literal="0" line="10" column="19" file="1"/>
          <tree node="IDENTIFIER" type="2646" domain="1" id="1iisv" parents="2" line="10" column="20" file="1">
        <literal>s</literal>
          </tree>
          <tree node="')'" type="2888" domain="1" id="1iit2" parents="2" literal="0" line="10" column="21" file="1"/>
        </tree>
        <tree node="postfix_expression" type="368" domain="1" id="1iiti" line="10" column="14" file="1">
          <tree node="IDENTIFIER" type="2646" domain="1" id="1iisk" parents="2" alreadyprinted="true"/>
          <tree node="'('" type="2887" domain="1" id="1iisu" parents="2" alreadyprinted="true"/>
          <tree node="IDENTIFIER" type="2646" domain="1" id="1iisv" parents="2" alreadyprinted="true"/>
          <tree node="')'" type="2888" domain="1" id="1iit2" parents="2" alreadyprinted="true"/>
        </tree>
      </tree>
      <tree node="';'" type="2939" domain="1" id="1iit4" literal="0" line="10" column="22" file="1"/>
    </tree>
    <tree node="endif_directive" type="2117" domain="1" id="1iitp" line="11" column="4" file="1">
      <tree node="'#'" type="2908" domain="1" id="1iits" literal="0" line="11" column="4" file="1"/>
      <tree node="'endif'" type="2743" domain="1" id="1iitr" literal="0" line="11" column="5" file="1"/>
      <tree node="new_line" type="2946" domain="1" id="1iitq" literal="0" line="11" column="10" file="1"/>
    </tree>
      </tree>
      <tree node="'}'" type="2941" domain="1" id="1iitm" literal="0" line="12" column="1" file="1"/>
    </tree>
  </tree>
</tree>
  </tree>
  <FileIndex>
<File index="1">C:/temp/small.cpp</File>
  </FileIndex>
  <DomainIndex>
<Domain index="1">Cpp~GCC4</Domain>
  </DomainIndex>
</DMSForest>

It won't quite round-trip from the XML; there's no XML reader precofigured to build an AST. However, DMS is highly customizable and has an XML parser as option; it would be straightforward to read an XML tree, regenerate the C++ AST, and then invoke the prettyprinter.

I'm not quite sure what you mean "manageable from a programmer's point of view". This is a precise tree. If it contains too much detail, you are welcome to apply XSLT transforms as you see fit to simplify it, but you will likely lose semantic accuracy doing so. And you will likely lose the ability to round-trip, too.

We don't see much need for such XML exports; the DMS ecosystem by design provides a huge amount of infrastructure for analyzing/transforming programs (including C++ programs); we've done massive C++ source code parsing/transformation with DMS. So the need to do the XML export to do something useful isn't very high. We offer it anyway, because people always ask for it. To our surprise, we have some clients that actually use it.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • The main idea is to for programmer to be able to set up his own system of comlicated macros, keeping them private (i.e. not committing them to actual code/explaining everyone how to use them), and using sort of macro-expand and _macro-unexpand_ for bi-directional exchange between "official long boring" and "funny complicated private short" codes. – Vi. Apr 13 '13 at 20:53
  • 2
    Looks like "DMS Software Reengineering Toolkit" is 1. big, 2 proprietary, 3. complex. I'm looking for little simple things (comparable for just using Perl with a bunch of regexes); I expect that if trade correctness, you can buy simplicity. There can be "unparsed" pieces of code which should be just preserved while round-tripping. – Vi. Apr 13 '13 at 21:04
  • @Vi: DMS is all the things you say. What you can't get is "simple", "C++" and "accurate". And if you think regexes will help you with this, you are in for a very big surprise; you're in (bad?) good company; many other people think this is true too. It is patently false. If you trade "correctness" away, your tool won't work on any real piece of code. DMS is the way it is because the complexity is necessary and the engineering is correspondingly not cheap; I've get 18 years personally invested. And, it works. – Ira Baxter Apr 13 '13 at 21:40
  • If I can read (mentally build AST) most C++ sources then my program can also read (build sort of AST) C++ sources. It's not a CAPTCHA... With `(unparsed-code "here-is-some-tricky-C++-construct")` I except some simple thing to be able to round-trip C++ (and in general arbitrary text files) while also enabling editing some things. One more concern is preservance and factoring out whitespace/codestyle. I want to split a source into "essence" (code and comments) and "fluff" (whitespace, style, sometimes unimportand order of operations) parts with the second just for proper 1 to 1 reconstruction... – Vi. Apr 13 '13 at 22:11
  • 1
    Whie DMS may be The Right Tool For The Job, it seems too far on the left on "Enterprise -- Community" scale; I probably can't just download&compile, run some demos, see docs, depend on it in my FOSS tool and put that tool everywhere I feel it is OK. – Vi. Apr 13 '13 at 22:16
  • That mental AST just doesn't work on a program with any size or complexity. I've been doing this a long time and never seen anybody succeed with "I can build a simple model of my code" and not make a serious error as a consequence; you're welcome to try but I've seen enough lemmings. Regarding Whitespace: DMS doesn't keep that, but it does keep line/column positions (see the XML above) and can regenerate the code very close to the spacing of the original. ... – Ira Baxter Apr 13 '13 at 22:22
  • ... Yes, DMS is on the "Enterprise" side, true, because of its breadth. Whether you can put it everywhere you feel is OK is a commercial deal. People do it with Oracle DB, and that's clearly enterprise level too. – Ira Baxter Apr 13 '13 at 22:23
  • `you're welcome to try but I've seen enough lemmings` -> Can I read somewhere an overview of the failed attempts (what they wanted, why they failed, what they decided in the end)? – Vi. Apr 13 '13 at 22:31
  • Try Island Grammars http://www.program-transformation.org/Transform/IslandGrammars for a formal attempt to do this; I know of no practical systems based on this idea. The Transformers project http://www.lrde.epita.fr/cgi-bin/twiki/view/Transformers/Transformers , a well funded group, tried to build a C++ parser; after 10 years they seem to have given up. SrcML http://en.wikipedia.org/wiki/SrcML attempts to capture ASTs in XML but doesn't use industrial strength parsers; the SRCML site talks about but does not exhibit a collection of applications.... – Ira Baxter Apr 14 '13 at 03:07
  • There's an ANTLR C++ grammar, but the author of it explicitly turns it over to the community because he's tired of working on it; it has never been used in anger. The only working C++ manipulation systems I know about are Clang (mentioned in another answer) and DMS. Both are big efforts because of the complexity of C++. Clang is stronger at compilation than DMS. DMS is better at code manipulation and scale than Clang (personal opinion). There are other C++ manipulation systems (Elsa, OpenC++) but these were never really complete and have essentially given up the ghost. – Ira Baxter Apr 14 '13 at 03:12
  • None of these will tell you why they failed; authors don't get paid and don't like to write about failures. Mostly its because of not enough effort. Either you have a huge open source effort like Clang, or a commercial project like DMS, or it dies of starvation. You can try to avoid all this by trying to avoid parsing some code (island grammars) but you invariably find that the meaning of the code you care about is determined by the code you very carefully didn't parse... so you can't reason mechanically about it. – Ira Baxter Apr 14 '13 at 03:15
  • ... I forgot to mention Rose (from Dan Quinlan et al) as a working C++ code manipulation system. – Ira Baxter Apr 14 '13 at 20:55
  • Thanks for the discussion. As I currently don't have the real need of such system (I just like to plan/design various things), I'll probably consider doing similar thing for Java first. Java have simpler syntax, so "just for fun" can go further... – Vi. Apr 14 '13 at 22:09
  • It probably won't surprise you that DMS handles Java, too, in essentially the same way. – Ira Baxter Apr 14 '13 at 22:25