3

The following is an EBNF-format (mostly - the actual syntax is documented here) grammar that I am attempting to generate a parser for:

expr = lambda_expr_list $;

lambda_expr_list = [ lambda_expr_list "," ] lambda_expr;

lambda_expr = conditional_expr [ "->" lambda_expr ];

conditional_expr = boolean_or_expr [ "if" conditional_expr "else" conditional_expr ];

boolean_or_expr = [ boolean_or_expr "or" ] boolean_xor_expr;

boolean_xor_expr = [ boolean_xor_expr "xor" ] boolean_and_expr;

boolean_and_expr = [ boolean_and_expr "and" ] boolean_not_expr;

boolean_not_expr = [ "not" ] relation;

relation = [ relation ( "=="
                      | "!="
                      | ">"
                      | "<="
                      | "<"
                      | ">="
                      | [ "not" ] "in"
                      | "is" [ "not" ] ) ] bitwise_or_expr;

bitwise_or_expr = [ bitwise_or_expr "|" ] bitwise_xor_expr;

bitwise_xor_expr = [ bitwise_xor_expr "^" ] bitwise_and_expr;

bitwise_and_expr = [ bitwise_and_expr "&" ] bitwise_shift_expr;

bitwise_shift_expr = [ bitwise_shift_expr ( "<<"
                                          | ">>" ) ] subtraction_expr;

subtraction_expr = [ subtraction_expr "-" ] addition_expr;

addition_expr = [ addition_expr "+" ] division_expr;

division_expr = [ division_expr ( "/"
                                | "\\" ) ] multiplication_expr;

multiplication_expr = [ multiplication_expr ( "*"
                                            | "%" ) ] negative_expr;

negative_expr = [ "-" ] positive_expr;

positive_expr = [ "+" ] bitwise_not_expr;

bitwise_not_expr = [ "~" ] power_expr;

power_expr = slice_expr [ "**" power_expr ];

slice_expr = member_access_expr { subscript };

subscript = "[" slice_defn_list "]";

slice_defn_list = [ slice_defn_list "," ] slice_defn;

slice_defn = lambda_expr
           | [ lambda_expr ] ":" [ [ lambda_expr ] ":" [ lambda_expr ] ];

member_access_expr = [ member_access_expr "." ] function_call_expr;

function_call_expr = atom { parameter_list };

parameter_list = "(" [ lambda_expr_list ] ")";

atom = identifier
     | scalar_literal
     | nary_literal;

identifier = /[_A-Za-z][_A-Za-z0-9]*/;

scalar_literal = float_literal
               | integer_literal
               | boolean_literal;

float_literal = point_float_literal
              | exponent_float_literal;

point_float_literal = /[0-9]+?\.[0-9]+|[0-9]+\./;

exponent_float_literal = /([0-9]+|[0-9]+?\.[0-9]+|[0-9]+\.)[eE][+-]?[0-9]+/;

integer_literal = dec_integer_literal
                | oct_integer_literal
                | hex_integer_literal
                | bin_integer_literal;

dec_integer_literal = /[1-9][0-9]*|0+/;

oct_integer_literal = /0[oO][0-7]+/;

hex_integer_literal = /0[xX][0-9a-fA-F]+/;

bin_integer_literal = /0[bB][01]+/;

boolean_literal = "true"
                | "false";

nary_literal = tuple_literal
             | list_literal
             | dict_literal
             | string_literal
             | byte_string_literal;

tuple_literal = "(" [ lambda_expr_list ] ")";

list_literal = "[" [ ( lambda_expr_list
                     | list_comprehension ) ] "]";

list_comprehension = lambda_expr "for" lambda_expr_list "in" lambda_expr [ "if" lambda_expr ];

dict_literal = "{" [ ( dict_element_list
                     | dict_comprehension ) ] "}";

dict_element_list = [ dict_element_list "," ] dict_element;

dict_element = lambda_expr ":" lambda_expr;

dict_comprehension = dict_element "for" lambda_expr_list "in" lambda_expr [ "if" lambda_expr ];

string_literal = /[uU]?[rR]?(\u0027(\\.|[^\\\r\n\u0027])*\u0027|\u0022(\\.|[^\\\r\n\u0022])*\u0022)/;

byte_string_literal = /[bB][rR]?(\u0027(\\[\u0000-\u007F]|[\u0000-\u0009\u000B-\u000C\u000E-\u0026\u0028-\u005B\u005D-\u007F])*\u0027|\u0022(\\[\u0000-\u007F]|[\u0000-\u0009\u000B-\u000C\u000E-\u0021\u0023-\u005B\u005D-\u007F])*\u0022)/;

The tool I'm using to generate the parser is Grako, which generates a modified Packrat parser that claims to support both direct and indirect left recursion.

When I run the generated parser on this string:

input.filter(e -> e[0] in ['t', 'T']).map(e -> (e.len().str(), e)).map(e -> '(Line length: ' + e[0] + ') ' + e[1]).list()

I get the following error:

grako.exceptions.FailedParse: (1:13) Expecting end of text. :
input.filter(e -> e[0] in ['t', 'T']).map(e -> (e.len().str(), e)).map(e -> '(Line length: ' + e[0] + ') ' + e[1]).list()
            ^
expr

Debugging has shown that the parser seems to get to the end of the first e[0], then never backtracks to/reaches a point where it will try to match the in token.

Is there some issue with my grammar such that a left recursion-supporting Packrat parser would fail on it? Or should I file an issue on the Grako issue tracker?

1 Answers1

4

It may be a bug in the grammar, but the error message is not telling you where it actually happens. What I always do after finishing a grammar is to embed cut (~) elements throughout it (after keywords like if, operators, opening parenthesis, everywhere it seems reasonable).

The cut element makes the Grako-generated parser commit to the option taken in the closest choice in the parse tree. That way, instead of having the parser fail at the start on an if, it will report failure at the expression it actually couldn't parse.

Some bugs in grammars are difficult to spot, and for that I just go through the parse trace to find out how far in the input the parser went, and why it decided it couldn't go further.

I will not use left-recursion on a PEG parser for professional work, though it may be fine for simpler, academic work.

boolean_or_expr = boolean_xor_expr {"or" boolean_xor_expr};

The associativity can then be handled in a semantic action.

Also see the discussion under issue 49 against Grako. It says that the algorithm used to support left recursion will not always produce the expected associativity in the resulting AST.

Apalala
  • 9,017
  • 3
  • 30
  • 48
  • 1
    Thanks for the tip regarding the cut elements. I suspect that will probably speed parsing up a bit as well, because right now it's pretty slow. – Trey Brisbane Mar 17 '15 at 02:25
  • 1
    That said, it's looking like I'm probably after a bottom-up parser if I want to keep the grammatical rules left-recursive. Regarding your point about the associativity - I can't help but wonder what the point is of having Packrat support left-recursion if the associativity of the rules is not preserved. Wouldn't it be simpler to just say "You need a bottom-up parser for this"? – Trey Brisbane Mar 17 '15 at 02:30
  • 1
    @Trey Left recursion makes it easier to write grammars for some languages. The lack of guarantees regarding associativity are not designed in, but a consequence of the algorithm used, and something the subject of research. Also note that the associativity you want will likely be preserved for your particular grammar. – Apalala Mar 17 '15 at 11:39
  • 1
    Apologies for the delay on this; I've been busy with Uni. Thanks for the clarification. Good to know. :) I've refactored the grammar, adding the cut elements as you suggested (the new grammar is available at http://pastebin.com/eykU3V9K). – Trey Brisbane Mar 24 '15 at 12:15
  • Given the same input string, the problem seems to be occurring within the list literal just after the first 'in' keyword. The generated parser will parse the first element (the 't'), then the comma successfully. In an attempt to parse the second element (the 'T'), the parser descends down to the power_expr rule, where it invokes the first slice_expr rule and successfully parses the 'T'. It then fails to parse the '**' token, and backtracks to take power_expr's second option, invoking the slice_expr rule a second time. – Trey Brisbane Mar 24 '15 at 12:15
  • The issue then manifests when the slice_expr rule attempts to invoke the member_access_expr rule, because for some reason there is already a FailedLeftRecursion entry in the memoization cache for the member_access_expr rule, preventing it from being invoked. Can you shed any light on this? – Trey Brisbane Mar 24 '15 at 12:15
  • Additionally, I've noticed that changing `power_expr = slice_expr "**" ~ power_expr | slice_expr;` to `power_expr = slice_expr [ "**" ~ power_expr ];` causes the parser to take an age to run (possibly due to catastrophic backtracking?). I would have thought the two were equivalent, since in the first form if the parser failed to match `"**"`, it would backtrack everything it's seen since the start of the `power_expr`, then try the second alternative? Am I missing something fundamental here? – Trey Brisbane Mar 25 '15 at 00:13
  • 1
    @Tray could you post this as an issue in [Bitbucket](https://bitbucket.org/apalala/grako). I'll take a look on the next pass through Grako. – Apalala Apr 05 '15 at 13:57
  • @TreyBrisbane Did you post it there? You have a link? – Robert Siemer Apr 11 '16 at 22:30
  • The link is where it says "Bitbucket". https://bitbucket.org/apalala/grako/issues – Apalala Apr 12 '16 at 01:58