How is code represented in Abstract Syntax Tree?

Question

I am working on my custom gcc frontend, and have few, yet unanswered questions on AST structure.

In the context of parsing code of a program into an AST, is AST a tree or a forest?

How would simple example like this appear in AST? Say we have a var declaration, a function declaration and a main function with a couple of assignments. It's just an example.

Would it be something like this?

                  root

       /           |           \
      /            |            \
    var           func         main
   / | \       /   |   \       /   \
 int x  5    args foo  int   asgn asgn
             / \             / \   /  \
           var var          a   x x    2 
           ... ...

So, if I read tree in infix order, I'll get the actual code sequence?

Updated.

Are you asking a question in general about ASTs, or are you asking what a specific version of GNU does? — Ira Baxter, May 24 '14 at 17:01
Yes, I am asking in general. I need to implement it on my own, so I am trying to understand the structure. If I understand it the right way, AST should keep all the information value needed for further processing (ast + symbol table), as I need to build GENERIC tree (gcc) from it. — notnull, May 24 '14 at 18:23
"All the information value needed for further processing" ... an AST can't do that. It represents the *syntactic* structure of the program and nothing else. "(information) needed for further processing" includes symbol tables, control flow sequencing, how data flows between program elements, interference of data lifetimes, ... none of these things are represented by ASTs. *Every* representation is good at representing specific things, and terrible and representing others. Consequently there are many information/program representations. — Ira Baxter, May 24 '14 at 19:44
Can you please provide me with some code (simple program, more than just a single expression or assignment - because I know how to build such a simple AST) and it's AST representation example? I'd appreciate it. — notnull, May 25 '14 at 20:04
See an AST for a simple Java program at http://stackoverflow.com/questions/6376662/how-a-ast-for-an-object-oriented-programming-language-would-look-like/6378997#6378997 — Ira Baxter, May 25 '14 at 22:22

Ira Baxter · Answer 1 · 2014-07-03T19:36:08.650

In general, an "AST" doesn't contain "the actual code sequence" no matter how you rearrange it. That is why it is called "abstract": detail has been dropped. It contains enough information to represent essentially what the program text said.

A concrete syntax tree arguably contains enough information to regenerate the source; it takes effort to store enough to regenerate the original program, especially if you want literal formatting information such as number radix and number of trailing zeros after a fraction.

Whether the information is stored for access in-order, pre-order or post-order or different for every node is a matter of implementation. (Most AST and CST trees tend to match the program text order when traversed in-order).

[EDIT 7/3/2014 to answer question about "single tree vs. forest" as a result]

As a general rule, one expects a single source file to be parsed to a single AST. However, real programs have lots of interesting complications; for instance, #include statements in C reference another file (abstractly expanded in place), and package references in Java reference another file. So, if a single source file references many other source files, and the single source file is parsed, should one AST for the single source file be produced, with no ASTs for the other files?

The answer to this question depends on the nature of the tool you are building. C compilers will expand the #include in place, and parsing a C file tends to produce one AST. Java compilers do not expand package references in place, but may need to parse the package source file to understand what it contains; in this case, you tend to get one many ASTs, one for the "main" program, and one for each package it references, and one for each (unique) package they reference, etc. If your tool isn't compiling, but wants to modify the C source code, you probabaly don't want to expand the #include in place, and so such a tool would have one source file for the main C program, and one for each #include encountered. So depending on your purpose, your (complex) parser may produce just one AST, or a set (which is typically called a "forest").

score 1 · Answer 2 · answered May 25 '14 at 18:01

You can dispel any doubt with a useful Eclipse plug-in that allows you to view the ast from anywhere inside the code: - install plug-in - make position (cusror) anywhere in your source code - go to in "Ast view" - click on "hierarchic" button of view - view the complete Ast

Plugin link from google query search "eclipse plugin ast view" P.s.: ast not is a forest but a "tree"

bye

How is code represented in Abstract Syntax Tree?

2 Answers2