Abstract Syntax Tree for Source Code including Expressions

Question

I am building a new simple programming language (just to learn how compilers work in my free time).

I have already built a lexer which can tokenize my source code into lexemes.

However, I am now stuck on how to form an Abstract Syntax Tree from the tokens, where the source code might contain an expression (with operator precedence).

For simplicity, I shall include only 4 basic operators: +, -, /, and * in addition to brackets (). Operator precedence will follow BODMAS rule.

I realize I might be able to convert the expression from infix to prefix/postfix, form the tree and substitute it.

However, I am not sure if that is possible. Even if it is possible, I am not sure how efficient it might be or how difficult it might be to implement.

Is there some trivial way to form the tree in-place without having to convert to prefix/postfix first?

I came across the Shunting Yard algorithm which seems to do this. However, I found it to be quite a complicated algorithm. Is there something simpler, or should I go ahead with implementing the Shunting Yard algorithm?

Currently, the following program is tokenized by my lexer as follows:

I am demonstrating using a Java program for syntax familiarity.

Source Program:

public class Hello
{
    public static void main(String[] args)
    {
        int a = 5;
        int b = 6;
        int c = 7;

        int r = a + b * c;

        System.out.println(r);
    }
}

Lexer output:

public
class
Hello
{
public
static
void
main
(
String
[
]
args
)
{
int
a
=
5
;
int
b
=
6
;
int
c
=
7
;
int
r
=
a
+
b
*
c
;
System
.
out
.
println
(
r
)
;
}
}

The shunting yard algorithm is really pretty simple if you are using it for expressions; if it doesn't seem simple, you're probably doing it wrong. :-) If your language is simple enough, you could use it for the entire parse, but that's not very common any more. — rici, Dec 19 '18 at 17:58
@rici Ok. I think I should give the Shunting Yard algorithm another try. Thanks. — Pratanu Mandal, Dec 19 '18 at 18:27
@rici Also, by this, do you mean to say that the Shunting Yard algorithm is the best and most efficient method of parsing expressions in my scenario? — Pratanu Mandal, Dec 19 '18 at 18:30
Of interest: Rosetta Code [Arithmetic evaluation](http://rosettacode.org/wiki/Arithmetic_evaluation) and [Parsing/Shunting-yard algorithm](http://rosettacode.org/wiki/Parsing/Shunting-yard_algorithm) — Guy Coder, Dec 19 '18 at 19:31
@pratanu: I'm not saying that, no. It's not necessarily the fastest or most efficient; that depends on how you implement it. But it's easy to write, so it's efficient in programmer time. A parser generator like bison is almost certainly faster and probably easier to code once you understand the tool, but there's more to learn. — rici, Dec 19 '18 at 19:45
See my SO answer on how to build simple parsers easily. The answer shows how to build ASTs that way, too: https://stackoverflow.com/a/2336769/120163 — Ira Baxter, Jan 20 '19 at 05:19

exebook · Accepted Answer · 2018-12-26T17:27:24.987

    // I know this might look ugly that I use a global variable ret to return parsed subtrees
    // but please bear with it, I got used to this for various performance/usability reasons
    
    var ret, tokens
    
    function get_precedence(op) {
     // this is an essential part, cannot parse an expression without the precedence checker
     if (op == '*' || op == '/' || op == '%') return 14
     if (op == '+' || op == '-') return 13
     if (op == '<=' || op == '>=' || op == '<' || op == '>') return 11
     if (op == '==' || op == '!=') return 10
     if (op == '^') return 8
     if (op == '&&') return 6
     if (op == '||') return 5
     return 0
    }
    
    function parse_primary(pos) {
     // in the real language primary is almost everything that can be on the sides of +
     // but here we only handle numbers detected with the JavaScript 'typeof' keyword
     if (typeof tokens[pos] == 'number') {
      ret = {
       type: 'number',
       value: tokens[pos],
      }
      return pos + 1
     }
     else {
      return undefined
     }
    }
    
    function parse_operator(pos) {
     // let's just reuse the function we already wrote insted of creating another huge 'if'
     if (get_precedence(tokens[pos]) != 0) {
      ret = {
       type: 'operator',
       operator: tokens[pos],
      }
      return pos + 1
     }
     else {
      return undefined
     }
    }
    
    function parse_expr(pos) {
     var stack = [], code = [], n, op, next, precedence
     
     pos = parse_primary(pos)
     if (pos == undefined) {
      // error, an expression can only start with a primary
      return undefined
     }
     stack.push(ret)
    
     while (true) {
      n = pos
      pos = parse_operator(pos)
      if (pos == undefined) break
      op = ret
      pos = parse_primary(pos)
      if (pos == undefined) break
      next = ret
      precedence = get_precedence(op.operator)
      while (stack.length > 0 && get_precedence(stack[stack.length - 1].operator) >= precedence) {
       code.push(stack.pop())
      }
      stack.push(op)
      code.push(next)
     } 
    
     while(stack.length > 0) {
      code.push(stack.pop())
     }
     if (code.length == 1) ret = code[0]
     else ret = {
      type: 'expr',
      stack: code,
     }
     return n
    }
    
    function main() {
     tokens = [1, '+', 2, '*', 3]
     var pos = parse_expr(0)
     if (pos) {
      console.log('parsed expression AST')
      console.log(ret)
     }
     else {
      console.log('unable to parse anything')
     }
    }
    
    main()

Here is your bare-bones implementation of shunting yard expression parsing. This is written in JavaScript. This is as minimalistic and simple as you can get. Tokenizing is left off for brevity, you give the parse the array of tokens (you call them lexemes).

The actual Shunting Yard is the parse_expr function. This is the "classic" implementation that uses the stack, this is my preference, some people prefer functional recursion.

Functions that parse various syntax elements are usually called "parselets". here we have three of them, one for expression, others are for primary and operator. If a parselet detects the corresponding syntax construction at the position pos it will return the next position right after the construct, and the construct itself in AST form is returned via the global variable ret. If the parselet does not find what it expects it returns undefined.

It is now trivially simple to add support for parens grouping (, just extend parse_primary with if (parse_group())... else if (parse_number())... etc. In the meantime your parse_primary will grow real big supporting various things, prefix operators, function calls, etc.

Abstract Syntax Tree for Source Code including Expressions

1 Answers1