parse code blocks starting and ending with specific keywords

Question

Following this question, I want to parse blocks of code starting with specific keywords (e.g., <firstKeyword>, <secondKeyword>, <thirdKeyword>, ...) ending with the keyword End. In between, some statements should end with either semicolon or a new line. What I have done so far can be seen in this repository, but shortly:

grammar <garammarName>

// Parser Rules

statement: EndOfStatment;

statement_list: statement+;

section:
    '<firstKeyword>' statement_list End
    | '<secondKeyword>' statement_list End
    | '<thirdKeyword>' statement_list End

sections: section+ EOF;

// Lexer Rules

End: 'End';

NewLine: ('\r'? '\n' | '\n' | '\r') -> skip;

WhiteSpace: [ \t\r\n]+ -> skip;

EndOfStatment: ';' | NewLine;

However, the issue is that the TestRig / grun tool (instructions) doesn't throw an error when the code blocks are not ended with End keyword. For example, the example code <exampleFile>:

<firstKeyword>
End

<secondKeyword>

<thirdKeyword>
End

doesn't return any error with

grun <garammarName> sections -tree < <exampleFile>

I would appreciate it if you could help me know the problem and how to solve it.

I am not familiar with the notation of the grammar for antlr but isnt it possible that the pipe symbol : | is applied only to the symbols not whole sequences...maybe you need to use some kind of brackets ? Like this: ('' statement_list End) | (...) | (...) — Sara Bean, Apr 06 '21 at 20:44

score 1 · Accepted Answer · answered Apr 06 '21 at 20:48

When I run input similar to what you've given here, I get:

➜ grun ElmerSolver sections -tree  < examples/ex001.sif
line 6:0 missing 'End' at 'Equation'
(sections (section Simulation statement_list End) (section Constants statement_list <missing 'End'>) (section Equation 1 statement_list End) <EOF>)

There's specifically an error for a missing 'End' on line 6. (line 6:0 missing 'End' at 'Equation')

ANTLR error recovery does supply the missing 'End' to recover and continue parsing, but it calls out the error.

For reference, this is the full grammar I'm using:

grammar ElmerSolver;

// Parser Rules

// eostmt: ';' | CR;

statement: EndOfStatment;

statement_list: statement*;

sections: section+ EOF;
// section: SectionName /* statement_list */ End;

// Lexer Rules

fragment DIGIT: [0-9];
Integer: DIGIT+;

Float:
    [+-]? (DIGIT+ ([.]DIGIT*)? | [.]DIGIT+) ([Ee][+-]? DIGIT+)?;

section:
    'Header' statement_list End                         # headerSection
    | 'Simulation' statement_list End                   # simulatorSection
    | 'Constants' statement_list End                    # constantsSection
    | 'Body' Integer statement_list End                 # bodySection
    | 'Material' Integer statement_list End             # materialSection
    | 'Body Force' Integer statement_list End           # bodyForceSection
    | 'Equation' Integer statement_list End             # equationSection
    | 'Solver' Integer statement_list End               # solverSection
    | 'Boundary Condition' Integer statement_list End   # boundaryConditionSection
    | 'Initial Condition' Integer statement_list End    # initialConditionSection
    | 'Component' Integer statement_list End            # componentSection;

End: 'End';

// statementEnd: ';' NewLine*;

NewLine: ('\r'? '\n' | '\n' | '\r') -> skip;

LineJoining:
    '\\' WhiteSpace? ('\r'? '\n' | '\r' | '\f') -> skip;

WhiteSpace: [ \t\r\n]+ -> skip;

LineComment: '#' ~( '\r' | '\n')* -> skip;

EndOfStatment: ';' | NewLine;

((I made your change for the EndOfStatement Lexer rule)

And this was the input file I used:

Simulation
End

Constants 

Equation 1
End

Here's the graphical view I get with the -gui grun option;

re: your change with the EndOfStatmentrule.

EndOfStatment should probably be a parser rule (lowercase).

Also, as your grammar stands, '\n' will always be recognized as a NewLine token with a -> skip rule to leave it out of the tokenStream.

run the grun with a -tokens option and you'll see no EndOfStatement tokens. (unless you put a ';' in your source file.)

➜ grun ElmerSolver sections -tree -tokens < examples/ex001.sif
[@0,0:9='Simulation',<'Simulation'>,1:0]
[@1,11:13='End',<'End'>,2:0]
[@2,16:24='Constants',<'Constants'>,4:0]
[@3,28:35='Equation',<'Equation'>,6:0]
[@4,37:37='1',<Integer>,6:9]
[@5,39:41='End',<'End'>,7:0]
[@6,42:41='<EOF>',<EOF>,7:3]
line 6:0 missing 'End' at 'Equation'
(sections (section Simulation statement_list End) (section Constants statement_list <missing 'End'>) (section Equation 1 statement_list End) <EOF>)

If you want the NewLine to be syntactically significant (i.e. you can use it in your grammar), you'll need to remove the -> skip.

However, once you do that, you'll have to be specific about all the places where a NewLine is valid (but I see your LineJoining token so it looks like this is supposed to have a bit of a Python feel, so that might be what you're going for). (same comment re: -> skip applies to this). If you're going down the "Python-like" route, understand that Pythongs EOL and indentation handling is notoriously add for parsers (And "The Definitive ANTLR 4 Reference" has a section dedicated to what has to be done to handle it). You could also reference the Python grammar at ANTLR Python grammar

parse code blocks starting and ending with specific keywords

1 Answers1