How to prevent lark from recognizing parts of an identifier as a keyword?

Question

I've been experimenting with lark and I came across a little problem. Suppose I have the following grammar.

parser = Lark('''
    ?start: value 
            | start "or" value -> or
    ?value: DIGIT -> digit 
            | ID -> id

    DIGIT: /[1-9]\d*/

    %import common.CNAME -> ID

    %import common.WS
    %ignore WS
    ''', parser='lalr')

Let's say I want to parse 1orfoo:

print(parser.parse("1orfoo").pretty())

I would expect lark to see it as the digit 1 followed by the identifier orfoo (thus throwing an error because the grammar does not accept this kind of expressions).

However, the parser runs without error and outputs this:

or
  digit 1
  id    foo

As you can see, lark splits the identifier and sees the expression as an or statement.

Why is this happening? Am I missing something? How can I prevent this kind of behavior?

Thank you in advance.

f9c69e9781fa194211448473495534 · Accepted Answer · 2020-05-20T20:55:38.813

Lark can use different lexers to structure the input text into tokens. The default is "auto", which chooses a lexer based on the parser. For LALR, the "contextual" lexer is chosen (reference). The contextual lexer uses the LALR look-ahead to discard token choices that do not fit with the grammar (reference):

The contextual lexer communicates with the parser, and uses the parser's lookahead prediction to narrow its choice of tokens. So at each point, the lexer only matches the subgroup of terminals that are legal at that parser state, instead of all of the terminals. It’s surprisingly effective at resolving common terminal collisions, and allows to parse languages that LALR(1) was previously incapable of parsing.

In your code, since you use the lalr parser, the contextual lexer is used. The lexer first creates a DIGIT token for 1. Next, the lexer has to decide whether to create a token for the or literal or an ID token. Since the parsing state does not expect an ID token, the lexer eliminates the latter choice and tokenizes or.

To change this behavior, you can select the standard lexer instead:

parser = Lark('''...''', parser='lalr', lexer='standard')

On your example, it will generate:

lark.exceptions.UnexpectedToken: Unexpected token Token(ID, 'orfoo') at line 1, column 2.
Expected one of: 
    * OR
    * $END

How to prevent lark from recognizing parts of an identifier as a keyword?

1 Answers1