Abstract Syntax Tree representation in C++

Question

I already have the tokenizer interface which creates a list of tokens. I have the working mechanism for the parser. It's really unique and works like a charm. The only thing that i miss is the basic structure of the AST. How the tree, the nodes and the statements should be represented on abstraction level. I don't need any implementation only a quick idea how should it look like in class hierarchy? I'm working on an object-oriented language. Yeah, i already realized that i will need two types of statements. Some value returning "expression" type statement and a non-returning, instruction flow controlling type statement. Thank you very much.

I recommend looking into the CLANG/LLVM project. Very solid, opensource compiler, with a lot going for it. It will help you get started, and digging through their forums/code will provide a better answer than anyone could in a stack overflow post. http://clang.llvm.org/docs/InternalsManual.html — MobA11y, Aug 12 '13 at 13:48
@ChrisCM: One issue with Clang, what they call `AST` is in fact an `ABT`, that is the *bindings* (this identifier was declared here, the type of this variable is X) are already resolved. It's easier to actually do this in a two-phases algorithm 1/ generate AST 2/ resolve bindings. — Matthieu M., Aug 12 '13 at 15:55
A correct clarification, but one I think unnecessary within the realm of using it as a starting point for understanding syntax trees. The types of nodes you need, structure, statements, etc... are similar in both schemes, and it is this that the OP is curious about. — MobA11y, Aug 12 '13 at 16:01
@HelloGoodbye: It's the best name I know of for Abstract Bindings Tree. That is, an AST enriched with semantic information: names are resolved, expressions and variables are typed, ... — Matthieu M., Dec 13 '18 at 07:42

Lake · Accepted Answer · 2013-08-12T14:27:06.883

22

If your language is imperative/c-like, the common scenario begins with top-hierarchy being split in 2 supertypes:

Expression
Statement

The program is a list of statement, which is a statement itself.

You will probably want to have one class for type of statement that extends the statement base class.

A typical scenario looks like:

statement block ( a list of statements )
ite (if then else)
for (a for loop with its initialization statements list, check expression, increment statements, and block
while (similar, but only check expression
variable declaration
assignment (including += -= ++ --, you can wrap all in one class with an operator field, a lval and an rval)
Function call (void one)

For the expressions:

Bop (binary operation, anything that has 2 operands and 1 operator i.e. + - * / % | & && || == <
Uop (unary operation, anything that has 1 operand and 1 operator i.e. ~ !)
Function call (not-void ones)
Conditional expression ( exp ? true val : false val )

The good thing of having this 2 abstractions (expressions and statements) is that inside all your classes, you will have abstract types, and will be able to visit the AST with the visitor pattern, for example.

For example, some classes would look like this (pseudocode):

class Ite extends Statement {
   Expression condition;
   Statement ifBranch;
   Statement elseBranch;
}


class Bop extends Expression {
   BOperator operator;  // +, -. * or whatever
   Expression left;     // Left operand
   Expression right;    // Right operand
}


class StatementBlock extends Statement {
   List<Statement> statements;
}


class Assignment extends Statement {
   AOperator assignOp;  // = += -= etc.
   LVal lvalue;         // The lvalue cannot be an arbitrary expression, you will usually have a specific type for it
   Expression rvalue;   // Right value
}

Also, you will need some way to represent types (for the AST, just static types are enough, if you project to implement some back-end as well, you will need some dynamic types too).

Static types can usually be specified with some enumerations, if you don't plan to support fixed-size arrays which need a size information. If you want fixed-size arrays with, size, you can implement one class for type and have the array type hold additional size information.

enum Type {
   CHAR,
   SHORT,
   INT,
   LONG,
   FLOAT,
   DOUBLE,
   ARRAY
}

class Float extends StaticType {
    final Type type = Type.FLOAT;
}

class Array extends StaticArray {
    final Type type = Type.ARRAY;

    int size;
}

You will then instantiate one StaticType instance for every type in the AST, for example when the user declares a variable. You will be able to use the same hierarchy if you plan to do static type-checking in the future, also.

As for running/interpreting the code in the AST form, you will need a Memory which will hold a stack/heap containing information about runtime memory. At that point you will need to store values together with their type information.

edited Aug 12 '13 at 14:27

answered Aug 12 '13 at 14:04

Lake

4,072
26
36

And how should i arrange these in a tree hierarchy? I need some kind of 'AST' and 'ASTNode' classes to walk thru the whole tree. – Daniel Adamko Aug 12 '13 at 14:20
In my solution, you don't actually need all nodes to be of the same type. You will have a root which is a StatementBlock, and it will contain a list of Statement s. Those statements will contain the subtree that their class defines. As you can see i defined the AST hierarchy in a recursive way, so that you can browse the whole tree with pattern visitor very simply. It is a very object-oriented approach, if you prefer not to use inheritance, you can have all ASTNode nodes with their type represented by an enumeration instead of having a different class for each node type. – Lake Aug 12 '13 at 14:23
The problem is i will dont know the type of the derived if i require a Statement type for example. I will only know it's a Statement nothing more. – Daniel Adamko Aug 12 '13 at 14:34
1

For that, you will have to use a very powerful design pattern called **pattern visitor**! See my other answer here, if it is the first time you see it you will need some time to understand: http://stackoverflow.com/questions/16165640/how-to-write-visitor-pattern-for-a-abstract-syntax-tree-in-c/18036236#18036236 – Lake Aug 12 '13 at 14:35
I said you don't actually need all nodes to be of the same type before, actually they will have to inherit from the same base class defined by the pattern visitor. I admit that was a bit unclear, sorry ^^ Its all written in the link! – Lake Aug 12 '13 at 14:41
Oh, don't mind about the language, if you use inheritance, they will all be the same ;) Take it as pseudocode. – Lake Aug 12 '13 at 14:42
Actually in c++ we don't have such magic like 'Type' as a return value. Life would be more easier with it, i agree. – Daniel Adamko Aug 12 '13 at 14:51
1

Actually i didn't suggest to return objects of type Class, just to wrap the *type* you will use in your language with an abstract class named Type. I am not using reflection at all, everything will work perfectly with C++! – Lake Aug 12 '13 at 15:09
Oh my god, it is working. Finally an easy solution i havent thought until now:) – Daniel Adamko Aug 12 '13 at 17:44
@Lake sorry to bother you, i'm making an interpreter in c++ with a similar concept behind, but i had to stop because i've no idea how to handle gotos, and without gotos you can't implement any break for cycles for example. My current structure has statements as you said, with blocks which hold a list of statements and memorizes which variables have been allocated in that block; so block destructor frees the memory. But i've no idea how to put a jump in that kind of structure – Barnack Oct 03 '18 at 17:50
@Barnack Syntactically I would handle it by having a "Label" node with a string value (the label you jump to), and then a "Goto" node with a string value (the goto that jumps to some label). Both node classes would inherit from Statement; the Goto node would only have a string "name" field. The Label node would have a string "name" field, and a List of Statements (just like the BlockStatement), so that all subsequent statements belong to that Label (until the next label). You can then implement the semantics by having a labels hashmap and executing the proper Label when a Goto is executed. – Lake Oct 04 '18 at 09:31
@Barnack As for "break" like statements, you can implement the behaviour of it directly in your semantics without changing anything about the syntax. In the semantic implementation of your for loop, you will have a loop that executes all the statements inside that loop. If you encounter a "break" statement, you will save that result somewhere, and you will break the loop so that all subsequent statements are not executed. – Lake Oct 04 '18 at 09:37

Abstract Syntax Tree representation in C++

1 Answers1