How to parse multiple tokens using python Sly Parser?

Question

I have a lexer:

from sly import Lexer

class BasicLexer(Lexer):
    tokens = {OBJECT, FUNCTON}
    ignore = '.'

    OBJECT = r'object\(\"(.*?)\"\)'
    FUNCTION = r'function\(\"(.*?)\"\)'

    def OBJECT(self, t):
        match = re.search(r'object\("(.*?)"\)', t.value)
        t.value =  match.group(1)
        return t

    def FUNCTION(self, t):
        match = re.search(r'function\("(.*?)"\)', t.value)
        t.value =  match.group(1)
        return t

When I run it, it returns 2 tokens:

if __name__ == '__main__':
    data = '''object("cars").function("work")'''
    lexer = BasicLexer()
    for tok in lexer.tokenize(data):
        print('type=%r, value=%r' % (tok.type, tok.value))

type='OBJECT', value='cars'

type='FUNCTION', value='work'

Now, creating parser:

from sly import Parser

class BasicParser(Parser):
    tokens = BasicLexer.tokens

    def __init__(self):
        self.env = { }

    @_('')
    def statement(self, p):
        pass

    @_('OBJECT')
    def statement(self, p):
        return ('object', p.OBJECT)

    @_('FUNCTION')
    def statement(self, p):
        return ('function', p.FUNCTION)

if __name__ == '__main__':
    lexer = BasicLexer()
    parser = BasicParser()
    text =  '''object("cars").function("work")'''
    result = parser.parse(lexer.tokenize(text))
    print(result)

returns the following error:

sly: Syntax error at line 1, token=FUNCTION

None

For some reason, it can't parse when lexer.tokenize(text) returns a generator generating multiple tokens. Any idea why?

Your grammar needs to describe your language. If your language is "a sequence of statements", then you need to write a grammar which says that. If your language is "zero or one statement" then your current grammar is fine, but I don't think that's really what you want. — rici, Oct 21 '22 at 02:26
If that's what you want to parse. (Except that you are ignoring `.`, which is probably a bad idea.) I think you should try to describe what your language looks like, instead of trying to find grammar snippets to copy. Try to describe it as simply as possible, but also as accurately as possible. The formal grammar should be very similar to the way you would describe your language to another programmer, or the way that languages have been described to you. — rici, Oct 21 '22 at 03:00
And try to get a better grasp on the concept of "token"; basically, a token is something without internal structure, or whose internal structure doesn't contribute to the syntax of the language. (Numbers have internal structure, in the sense that each digit is interpreted according to where it is in the number. But that's not relevant to the parse. On the other hand, `function("argument")` clearly has important internal structure, since you use a regular expression to pick it apart. It would be better to treat that as four tokens: `function`, `(`, `"argument"`, and `)`. — rici, Oct 21 '22 at 03:02
@sshussain270 Did you ever get a chance to look into this again? — Sean Duggan, Nov 14 '22 at 03:07

score 0 · Answer 1 · answered Oct 27 '22 at 13:45

As noted above in rici's comments, you probably need to take a step or two back to decompose your object and function tokens into smaller pieces that can be processed. Something like the following may make more sense:

from sly import Lexer

class BasicLexer(Lexer):
    tokens = {OBJECT, FUNCTION, LPAREN, RPAREN, SCONST, DOT}

    # String literal
    SCONST = r'\"([^\\\n]|(\\.))*?\"'

    def SCONST(self, t):
        t.value = t.value[1:-1] #Strip the quotation marks
        return t

    DOT = r'\.'
    LPAREN = r'\('
    RPAREN = r'\)'
    
    OBJECT = r'object'
    FUNCTION = r'function'

Then, for your parser, you'd be building something more like the following:

from sly import Parser

class BasicParser(Parser):
    tokens = BasicLexer.tokens

    def __init__(self):
        self.env = { }

    @_('object DOT function')
    def statement(self, p):
        return ('statement', p.object, p.function)

    @_('OBJECT LPAREN SCONST RPAREN')
    def object(self, p):
        return ('object', p.SCONST)

    @_('FUNCTION LPAREN SCONST RPAREN')
    def function(self, p):
        return ('function', p.SCONST)

if __name__ == '__main__':
    lexer = BasicLexer()
    parser = BasicParser()
    text =  '''object("cars").function("work")'''
    tokens_bak = lexer.tokenize(text)
    for tok in tokens_bak:
        print('type=%r, value=%r, lineno=%r, index=%r, end=%r' % (tok.type, tok.value, tok.lineno, tok.index, tok.end))


    result = parser.parse(lexer.tokenize(text))
    print(result)

Note that this only allows a single statement.

How to parse multiple tokens using python Sly Parser?

1 Answers1