0

In Short

I need to get some kind of AST representation of GCC and Clang. Due to their complexity and size, I cannot find an easy way to achieve this.

The Details

For a project with the goal of comparing large programs with respect to their AST (similarly to how DECKARD does it), I require to get AST representations of GCC and Clang. (Also note that I do not necessarily require a single AST. I am completely content on receiving one AST per translation unit, and don't need a symbol table or headers.)

After some research I found a few possibilities on how to get the AST. However, all of those seem to have their own issues:

  1. Using the frontend of Clang with clang -emit-ast foo.c. - This seems to work well for small projects but managing all include paths for the GCC source code has proved to be difficult, resulting in many "undeclared type/identifier" errors.
  2. Using the frontend of Clang with clang -Xclang -ast-dump foo.c >> a.xml. - Same issue as above but, some XML output is still produced, so the XML would have to be parsed. (Also: Is this output incomplete/erroneous?)
  3. Writing a (F)LEX + YACC/BISON parser for C++ along the lines of FOG. - This sounds like a lot of effort and being prone to errors.
  4. Using the frontend of GCC: gcc -fdump-tree-all-graph foo.c. - The generated .dot file(s) would have to be parsed, so I would again have to write a (F)LEX + YACC/BISON parser. Also I suppose the same "undeclared symbols" issue as with option 1 might arise.
  5. Using the DMS software suggested by this answer. - This software is proprietary.

My Questions

  • Does anyone have a comparatively simple idea on how to progress?
  • Are the XML files of option 2 erroneous or missing AST nodes?
  • Is there a clang flag that suppresses the "undeclared symbols/identifier"-issues?
  • Is there an easier way to find all required include paths than going through each file individually or trying to understand the 31k lines of the corresponding autogenerated GCC Makefile?
  • Is the FOG parser of option 3 hard to adapt to output some kind of AST representation?
  • Do other (relyable) sources for C++ LEX and YACC files exist somewhere? (I know a C version exists here.)
  • Are there other options that I do not see to get AST representations of GCC and Clang?

Thanks a lot in advance.

Marc
  • 23
  • 5
  • So your main issue is to extract include paths and defines by translation unit(TU)? Clang provides [JSONCompilationDatabase](https://clang.llvm.org/docs/JSONCompilationDatabase.html)... – Jarod42 Jun 17 '21 at 13:17
  • @Jarod42 so, if I understand that correctly, that would give me a better option of specifying the include paths (rather than passing them with -I). However, I would still have to specify them by hand, right? – Marc Jun 17 '21 at 13:42
  • If your big project uses CMake, it can be generated. So mostly depends of your toolchain (potential hack, if your toolchain doesn't provide a clean way, might be to "replace" CXX command by custom tool to save flags options instead of compiling). – Jarod42 Jun 17 '21 at 13:51
  • @Jarod42 thanks for the idea. I'll try to figure out how to auto-generate the JSON file. Also the hack you propose sounds interesting. – Marc Jun 18 '21 at 07:55

0 Answers0