2

Hello I am building my first programming language with sly module in python. I was following some tutorials and I have a question about how the parser behaves. This is the code:

from sly import Lexer
from sly import Parser

class BasicLexer(Lexer):
    tokens = { NAME, NUMBER, STRING, IF, THEN, ELSE, FOR, FUN, TO, ARROW, EQEQ }
    ignore = '\t '

    literals = { '=', '+', '-', '/', '*', '(', ')', ',', ';' }

    # Define tokens
    IF = r'IF'
    THEN = r'THEN'
    ELSE = r'ELSE'
    FOR = r'FOR'
    FUN = r'FUN'
    TO = r'TO'
    ARROW = r'->'
    NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'
    STRING = r'\".*?\"'

    EQEQ = r'=='

    @_(r'\d+')
    def NUMBER(self, t):
        t.value = int(t.value)
        return t

    @_(r'#.*')
    def COMMENT(self, t):
        pass

    @_(r'\n+')
    def newline(self,t ):
        self.lineno = t.value.count('\n')

class BasicParser(Parser):
    tokens = BasicLexer.tokens

    precedence = (
        ('left', '+', '-'),
        ('left', '*', '/'),
        ('right', 'UMINUS'),
        )

    def __init__(self):
        self.env = { }
    @_('')
    def statement(self, p):
        pass

    @_('FOR var_assign TO expr THEN statement')
    def statement(self, p):
        return ('for_loop', ('for_loop_setup', p.var_assign, p.expr), p.statement)

    @_('IF condition THEN statement ELSE statement')
    def statement(self, p):
        return ('if_stmt', p.condition, ('branch', p.statement0, p.statement1))

    @_('FUN NAME "(" ")" ARROW statement')
    def statement(self, p):
        return ('fun_def', p.NAME, p.statement)

    @_('NAME "(" ")"')
    def statement(self, p):
        return ('fun_call', p.NAME)

    @_('expr EQEQ expr')
    def condition(self, p):
        return ('condition_eqeq', p.expr0, p.expr1)

    @_('var_assign')
    def statement(self, p):
        return p.var_assign

    @_('NAME "=" expr')
    def var_assign(self, p):
        return ('var_assign', p.NAME, p.expr)

    @_('NAME "=" STRING')
    def var_assign(self, p):
        return ('var_assign', p.NAME, p.STRING)

    @_('expr')
    def statement(self, p):
        return (p.expr)

    @_('expr "+" expr')
    def expr(self, p):
        return ('add', p.expr0, p.expr1)

    @_('expr "-" expr')
    def expr(self, p):
        return ('sub', p.expr0, p.expr1)

    @_('expr "*" expr')
    def expr(self, p):
        return ('mul', p.expr0, p.expr1)

    @_('expr "/" expr')
    def expr(self, p):
        return ('div', p.expr0, p.expr1)

    @_('"-" expr %prec UMINUS')
    def expr(self, p):
        return ('neg', p.expr)

    @_('NAME')
    def expr(self, p):
        return ('var', p.NAME)

    @_('NUMBER')
    def expr(self, p):
        return ('num', p.NUMBER)

if __name__ == '__main__':
    lexer = BasicLexer()
    parser = BasicParser()
    env = {}
    while True:
        try:
            text = input('basic > ')
        except EOFError:
            break
        if text:
            tree = parser.parse(lexer.tokenize(text))
            print(tree)

From what I understand, the condition rule is non terminal and the BNF representation is: condition : expr EQEQ expr. If I give as input "a == b" I get the error sly: Syntax error at line 1, token=EQEQ ('var', 'b'). I know this is the wanted behavior as there's no meaning in giving this input outside a statement, but I want to understand why the condition method is not recalled. Thanks!

Mazzola
  • 75
  • 7

2 Answers2

1

Although we call this technique bottom-up parsing, it's really still top-down in the sense that the grammar is applied starting with what will be the root of the parse, which is the start symbol, on this case statement.

What makes the parse "bottom-up" is the construction of the parse tree. Unlike top-down parsing, in which the next tree node is immediately predicted (and added to the parse tree) before the child nodes are parsed, the bottom-up parser first parses the child nodes and only once all the children have been reduced does it add the parent node. Thus, the root node is the last thing added to the tree; the tree has been built from the bottom up. [Note 1]

This gives the parser a lot more flexibility because it doesn't need to know for certain which production to predict. It can predict a number of possibilities and explore them on parallel. (This can be done in linear time using a state machine.)

In effect, condition only becomes relevant when it might be the next thing to parse. In your grammar, that only happens after an IF token, so it doesn't apply to your sample input.

Here's a quick summary of the parsing process. (If you haven't been exposed to finite state machines, this might not be so easy to follow.) It's useful to know that:

  • An item is a single production from the grammar with a position marker (written as •) which indicates the current input position.
  • An itemset is a set of items. The parser has a finite state machine whose states are itemsets. Since there are a finite number of productions, there are a finite number of possible items and thus a finite number of itemsets. That makes it possible to precompute the finite state machine.
  • If an item has the position mark at the end, then the item can be reduced; the fact that the position mark is at the end of the item indicates that the entire production has been recognised. Reducing a production is what actually creates a parse-tree node (or moral equivalent).
  • Since this parser has lookahead (of one symbol), each reducible item also has a lookahead set, which is just a set of terminals. The reduction can only be performed if the next input symbol is in this set. I'm not going to get into how this set is computed. This set is usually written after the item surrounded in [...].
  • If the position mark is anywhere else in the item, either a shift or goto transition is possible. A shift is performed when a position mark in the current state is just before a terminal and that terminal is the next input token. A goto transition is performed when a position mark is just before a non-terminal and that non-terminal has just been reduced. The goto transition is taken from the state which was active when recognition of the non-terminal started. (The parser uses stack to keep track of state history.)
  • Both shifts and gotos are handled by selecting the items from the itemset in which the position mark precedes the terminal or non-terminal which has been recognised, and advancing the position mark in those items. The selected items are the itemset for the next state (but see the following point).
  • If the position mark is at the beginning of an item which starts with a non-terminal, all of the productions for that non-terminal are added to the itemset, also with their position marks at the beginning. (This may need to be done repetitively.)
  • The initial state for the state machine is an item representing the grammar's start symbol followed by an end-of-input mark (usually written as $), with the position mark at the beginning. This itemset is expanded as per the step above.

With your grammar, the state machine's starting state is the itemset →•:

START      → • statement $
statement  → •                 [ $ ]
statement  → • FOR var_assign TO expr THEN statement
statement  → • IF condition THEN statement ELSE statement
statement  → • FUN NAME "(" ")" ARROW statement
statement  → • NAME "(" ")"
statement  → • expr
statement  → • var_assign
expr       → • expr "+" expr
expr       → • expr "-" expr
expr       → • expr "*" expr
expr       → • expr "/" expr
expr       → • "-" expr
expr       → • NAME
expr       → • NUMBER
var_assign → • NAME "=" expr
var_assign → • NAME "=" STRING

Note that no production for condition is in this itemset, because that cannot be at the beginning of a sentence generated from the start symbol.

The first input token is a, which is a NAME. That disallows the reduction of statement → •, so the next state consists of the items with NAME following the marker:

statement  → NAME • "(" ")"
expr       → NAME •            [ "+", "-", "*", "/", ELSE, $ ]
var_assign → NAME • "=" expr
var_assign → NAME • "=" STRING

Now the next token is EQEQ. At this point, the state machine is stuck. No shift action is possible and the only reduce action does not have == in its lookahead set. So a syntax error is produced.


Notes

  1. Confusingly, mathematical trees, unlike the trees outside your window, have the root at the top and the leaves on the bottom. Also, it's possible that the tree isn't really built; that's up to you. But there is always an implicit tree defined by the order the action functions are executed: children first, finally the parent.
rici
  • 234,347
  • 28
  • 237
  • 341
  • That's a bit confusing. The parser is top down but it's construction is bottom-up. This should be the reason why I get the parsing error. Can you provide an example of a step by step tree construction when the parser finds the input `a==b`? I think it would clarify my doubts. Thanks a lot. – Mazzola Aug 10 '21 at 09:10
  • 1
    @Mazzola: Added a description of the parse up to the syntax error, which was longer than I would have preferred. Hope it's of some use. – rici Aug 11 '21 at 17:23
0

If you make the condition an expr it will work:

def expr(self, p):
        return ('if_stmt', p.expr, ('branch', p.statement0, p.statement1))

and

def expr(self, p):  # was: condition
        return ('condition_eqeq', p.expr0, p.expr1)