Parsing of optionals with PEG (Grako) falling short?

Question

My colleague PaulS asked me the following:

I'm writing a parser for an existing language (SystemVerilog - an IEEE standard), and the specification has a rule in it that is similar in structure to this:

cover_point 
    = 
    [[data_type] identifier ':' ] 'coverpoint' identifier ';' 
    ;

data_type 
    = 
    'int' | 'float' | identifier 
    ;

identifier 
    = 
    ?/\w+/? 
    ;

The problem is that when parsing the following legal string:

anIdentifier: coverpoint another_identifier;

anIdentifier matches with data_type (via its identifier option) successfully, which means Grako is looking for another identifier after it and then fails. It doesn't then try to parse without the data_type part.

I can re-write the rule as follows,

cover_point_rewrite  
    = 
    [data_type identifier ':' | identifier ':' ] 'coverpoint' identifier ';' 
    ;

but I wonder if:

this is intentional and
if there's a better syntax?

Is this a PEG-in-general issue, or a tool (Grako) one?

My own take on it is that, yeah, one must tweak grammars to force PEG parsers to choose the longest possible option first. — Apalala, Jul 06 '14 at 22:06

score 2 · Answer 1 · edited Sep 19 '16 at 00:47

It says here that in PEGs the choice operator is ordered to avoid CFGs ambiguities by using the first match.

In your first example

[data_type]

succeeds parsing id, so it fails when it finds : instead of another identifier. That may be because [data_type] behaves like (data_type | ε) so it will always parse data_type with the first id.

In

[data_type identifier ':' | identifier ':' ]

the first choice fails when there is no second id, so the parser backtracks and tries with the second choice.

Parsing of optionals with PEG (Grako) falling short?

1 Answers1

Linked