0

I have a lexer:

from sly import Lexer

class BasicLexer(Lexer):
    tokens = {OBJECT, FUNCTON}
    ignore = '.'

    OBJECT = r'object\(\"(.*?)\"\)'
    FUNCTION = r'function\(\"(.*?)\"\)'

    def OBJECT(self, t):
        match = re.search(r'object\("(.*?)"\)', t.value)
        t.value =  match.group(1)
        return t

    def FUNCTION(self, t):
        match = re.search(r'function\("(.*?)"\)', t.value)
        t.value =  match.group(1)
        return t


When I run it, it returns 2 tokens:

if __name__ == '__main__':
    data = '''object("cars").function("work")'''
    lexer = BasicLexer()
    for tok in lexer.tokenize(data):
        print('type=%r, value=%r' % (tok.type, tok.value))

type='OBJECT', value='cars'

type='FUNCTION', value='work'

Now, creating parser:

from sly import Parser

class BasicParser(Parser):
    tokens = BasicLexer.tokens

    def __init__(self):
        self.env = { }

    @_('')
    def statement(self, p):
        pass

    @_('OBJECT')
    def statement(self, p):
        return ('object', p.OBJECT)

    @_('FUNCTION')
    def statement(self, p):
        return ('function', p.FUNCTION)

if __name__ == '__main__':
    lexer = BasicLexer()
    parser = BasicParser()
    text =  '''object("cars").function("work")'''
    result = parser.parse(lexer.tokenize(text))
    print(result)

returns the following error:

sly: Syntax error at line 1, token=FUNCTION

None

For some reason, it can't parse when lexer.tokenize(text) returns a generator generating multiple tokens. Any idea why?

sshussain270
  • 1,785
  • 4
  • 25
  • 49
  • Which of your productions recognizes two `statements`? – rici Oct 21 '22 at 02:16
  • Also, your tokenizer is doing too much parsing. – rici Oct 21 '22 at 02:20
  • Do I need one? @rici Can you show me how? – sshussain270 Oct 21 '22 at 02:21
  • Your grammar needs to describe your language. If your language is "a sequence of statements", then you need to write a grammar which says that. If your language is "zero or one statement" then your current grammar is fine, but I don't think that's really what you want. – rici Oct 21 '22 at 02:26
  • So I need to add something like ` @_('OBJECT . FUNCTION')? – sshussain270 Oct 21 '22 at 02:32
  • 1
    If that's what you want to parse. (Except that you are ignoring `.`, which is probably a bad idea.) I think you should try to describe what your language looks like, instead of trying to find grammar snippets to copy. Try to describe it as simply as possible, but also as accurately as possible. The formal grammar should be very similar to the way you would describe your language to another programmer, or the way that languages have been described to you. – rici Oct 21 '22 at 03:00
  • 1
    And try to get a better grasp on the concept of "token"; basically, a token is something without internal structure, or whose internal structure doesn't contribute to the syntax of the language. (Numbers have internal structure, in the sense that each digit is interpreted according to where it is in the number. But that's not relevant to the parse. On the other hand, `function("argument")` clearly has important internal structure, since you use a regular expression to pick it apart. It would be better to treat that as four tokens: `function`, `(`, `"argument"`, and `)`. – rici Oct 21 '22 at 03:02
  • @sshussain270 Did you ever get a chance to look into this again? – Sean Duggan Nov 14 '22 at 03:07

1 Answers1

0

As noted above in rici's comments, you probably need to take a step or two back to decompose your object and function tokens into smaller pieces that can be processed. Something like the following may make more sense:

from sly import Lexer

class BasicLexer(Lexer):
    tokens = {OBJECT, FUNCTION, LPAREN, RPAREN, SCONST, DOT}

    # String literal
    SCONST = r'\"([^\\\n]|(\\.))*?\"'

    def SCONST(self, t):
        t.value = t.value[1:-1] #Strip the quotation marks
        return t

    DOT = r'\.'
    LPAREN = r'\('
    RPAREN = r'\)'
    
    OBJECT = r'object'
    FUNCTION = r'function'

Then, for your parser, you'd be building something more like the following:

from sly import Parser

class BasicParser(Parser):
    tokens = BasicLexer.tokens

    def __init__(self):
        self.env = { }

    @_('object DOT function')
    def statement(self, p):
        return ('statement', p.object, p.function)

    @_('OBJECT LPAREN SCONST RPAREN')
    def object(self, p):
        return ('object', p.SCONST)

    @_('FUNCTION LPAREN SCONST RPAREN')
    def function(self, p):
        return ('function', p.SCONST)

if __name__ == '__main__':
    lexer = BasicLexer()
    parser = BasicParser()
    text =  '''object("cars").function("work")'''
    tokens_bak = lexer.tokenize(text)
    for tok in tokens_bak:
        print('type=%r, value=%r, lineno=%r, index=%r, end=%r' % (tok.type, tok.value, tok.lineno, tok.index, tok.end))


    result = parser.parse(lexer.tokenize(text))
    print(result)

Note that this only allows a single statement.

Sean Duggan
  • 1,105
  • 2
  • 18
  • 48