5

My colleague PaulS asked me the following:


I'm writing a parser for an existing language (SystemVerilog - an IEEE standard), and the specification has a rule in it that is similar in structure to this:

cover_point 
    = 
    [[data_type] identifier ':' ] 'coverpoint' identifier ';' 
    ;

data_type 
    = 
    'int' | 'float' | identifier 
    ;

identifier 
    = 
    ?/\w+/? 
    ;

The problem is that when parsing the following legal string:

anIdentifier: coverpoint another_identifier;

anIdentifier matches with data_type (via its identifier option) successfully, which means Grako is looking for another identifier after it and then fails. It doesn't then try to parse without the data_type part.

I can re-write the rule as follows,

cover_point_rewrite  
    = 
    [data_type identifier ':' | identifier ':' ] 'coverpoint' identifier ';' 
    ;

but I wonder if:

  1. this is intentional and
  2. if there's a better syntax?

Is this a PEG-in-general issue, or a tool (Grako) one?

Apalala
  • 9,017
  • 3
  • 30
  • 48
  • 2
    My own take on it is that, yeah, one must tweak grammars to force PEG parsers to choose the longest possible option first. – Apalala Jul 06 '14 at 22:06

1 Answers1

2

It says here that in PEGs the choice operator is ordered to avoid CFGs ambiguities by using the first match.

In your first example

[data_type]
succeeds parsing id, so it fails when it finds : instead of another identifier. That may be because [data_type] behaves like (data_type | ε) so it will always parse data_type with the first id.

In

[data_type identifier ':' | identifier ':' ]
the first choice fails when there is no second id, so the parser backtracks and tries with the second choice.
jalanb
  • 1,097
  • 2
  • 11
  • 37
1010
  • 1,779
  • 17
  • 27