1

Question

TLDR: I want to match anything but /.+?/ doesnt' seem to work, why?


I have the following super simple grammar and code:

from lark import Lark, Tree

parser: Lark = Lark(r"""
    rterm: "(___hole 0" anything  ")"
    
    anything: /.+?/
    
    %import common.ESCAPED_STRING 
    %import common.SIGNED_NUMBER
    %import common.WS
    %ignore WS

    """, start='rterm')

test_strings: list[str] = ["(___hole 0 (fun n : nat => ___hole 1 (___hole 2 eq_refl : 0 + n = n)))"]

for test_string in test_strings:
    print(f'{test_string=}')
    tree: Tree = parser.parse(test_string)
    print(tree.pretty())

when I try to parse the only test string I have it gives me an error:

Traceback (most recent call last):
  File "/Users/brandomiranda/miniconda/envs/iit_term_synthesis/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3398, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-18-352bf581b4ee>", line 19, in <cell line: 17>
    tree: Tree = parser.parse(test_string)
  File "/Users/brandomiranda/miniconda/envs/iit_term_synthesis/lib/python3.9/site-packages/lark/lark.py", line 581, in parse
    return self.parser.parse(text, start=start, on_error=on_error)
  File "/Users/brandomiranda/miniconda/envs/iit_term_synthesis/lib/python3.9/site-packages/lark/parser_frontends.py", line 106, in parse
    return self.parser.parse(stream, chosen_start, **kw)
  File "/Users/brandomiranda/miniconda/envs/iit_term_synthesis/lib/python3.9/site-packages/lark/parsers/earley.py", line 297, in parse
    to_scan = self._parse(lexer, columns, to_scan, start_symbol)
  File "/Users/brandomiranda/miniconda/envs/iit_term_synthesis/lib/python3.9/site-packages/lark/parsers/xearley.py", line 144, in _parse
    to_scan = scan(i, to_scan)
  File "/Users/brandomiranda/miniconda/envs/iit_term_synthesis/lib/python3.9/site-packages/lark/parsers/xearley.py", line 118, in scan
    raise UnexpectedCharacters(stream, i, text_line, text_column, {item.expect.name for item in to_scan},
lark.exceptions.UnexpectedCharacters: No terminal matches 'f' in the current parser context, at line 1 col 13
(___hole 0 (fun n : nat => ___hole 1 (___hole 2 eq_r
            ^
Expected one of: 
    * RPAR

focus on the last line:

lark.exceptions.UnexpectedCharacters: No terminal matches 'f' in the current parser context, at line 1 col 13
(___hole 0 (fun n : nat => ___hole 1 (___hole 2 eq_r
            ^
Expected one of: 
    * RPAR

which surprises me because I would have expected .+? to match any characgter but it claims that it can't match the f. Does anyone know why?


Research

I've search and saw these two relevant questions but their contents didn't help:

(nearly) cross posted here: https://github.com/lark-parser/lark/discussions/1163

Charlie Parker
  • 5,884
  • 57
  • 198
  • 323
  • this seems useful: https://stackoverflow.com/questions/61372172/lark-grammar-how-does-the-escaped-string-regex-work but not answered my question fully yet. – Charlie Parker Jul 07 '22 at 17:08

1 Answers1

1

This is happening because you are using a non-greedy regex operator, +?, which means Lark will match only one character. You should be using +, which is greedy.

But that's not enough, because now there's another problem. Our greedy operator is also catching the last ")", which we need for the last token.

There are 2 ways forward:

  1. Use the parser to match the brackets. But that means you'll have to write the grammar for everything inside your "anything".

  2. Use the "dynamic_complete" lexer. It's a lot easier, but will run slower. See here for an explanation: https://lark-parser.readthedocs.io/en/latest/parsers.html#earley

Here's the new code with the second approach:

from lark import Lark, Tree

parser: Lark = Lark(r"""
    rterm: "(___hole 0" ANYTHING  ")"
    
    ANYTHING: /.+/
    
    %import common.ESCAPED_STRING 
    %import common.SIGNED_NUMBER
    %import common.WS
    %ignore WS

    """, start='rterm', lexer="dynamic_complete")
Charlie Parker
  • 5,884
  • 57
  • 198
  • 323
Erez
  • 1,287
  • 12
  • 18
  • does `+?` regex have any special meaning in lark or is it a regular regex thing? – Charlie Parker Jul 07 '22 at 19:27
  • Lark delegates all regex behavior to Python, so no. – Erez Jul 07 '22 at 19:43
  • Apologies for not being smart enough to understand the link you shared (and I thank you for the link). But how exactly does `dynamic_complete` work differently? I assume the crux of my answer in this sentence: `Setting lexer="dynamic_complete" instructs the lexer to consider every possible regexp match. This ensures that the parser will consider and resolve every ambiguity, even inside the terminals themselves.` which I do not understand. – Charlie Parker Jul 07 '22 at 19:52
  • I think I understand that the ambiguity is that there is a match with the terminal `/.+/` and with the text inside the `anything` and the concatenation with the open paren `)`. But how does it know how to resolve the ambiguity? – Charlie Parker Jul 07 '22 at 19:53
  • Technically speaking, in this specific situation there is no ambiguity - the last character, `)`, has to match the last token, otherwise there is no valid parse. More complex expressions might give rise to ambiguity, in which case the best derivation is chosen based on priority. Or, if there is none, chosen based on the default heuristics, like minimizing the number of tokens. – Erez Jul 07 '22 at 19:59