I am working on an ANTLR4 grammar for parsing Python DSL scripts (a subset of Python, basically) with the target set as the Python 3. I am having difficulties handling the line feed.
In my grammar, I use lexer::members
and NEWLINE
embedded code based on Bart Kiers's Python3 grammar for ANTLR4 which are ported to Python so that they can be used with Python 3 runtime for ANTLR instead of Java. My grammar differs from the one provided by Bart (which is almost the same used in the Python 3 spec) since in my DSL I need to target only certain elements of Python. Based on extensive testing of my grammar, I do think that the Python part of the grammar in itself is not the source of the problem and so I won't post it here in full for now.
The input for the grammar is a file, catched by the file_input rule:
file_input: (NEWLINE | statement)* EOF;
The grammar performs rather well on my DSL and produces correct ASTs. The only problem I have is that my lexer rule NEWLINE
clutters the AST with \r\n
nodes and proves troublesome when trying to extend the generated MyGrammarListener
with my own ExtendedListener
which inherits from it.
Here is my NEWLINE
lexer rule:
NEWLINE
: ( {self.at_start_of_input()}? SPACES
| ( '\r'? '\n' | '\r' | '\f' ) SPACES?
)
{
import re
from MyParser import MyParser
new_line = re.sub(r"[^\r\n\f]+", "", self._interp.getText(self._input))
spaces = re.sub(r"[\r\n\f]+", "", self._interp.getText(self._input))
next = self._input.LA(1)
if self.opened > 0 or next == '\r' or next == '\n' or next == '\f' or next == '#':
self.skip()
else:
self.emit_token(self.common_token(self.NEWLINE, new_line))
indent = self.get_indentation_count(spaces)
if len(self.indents) == 0:
previous = 0
else:
previous = self.indents[-1]
if indent == previous:
self.skip()
elif indent > previous:
self.indents.append(indent)
self.emit_token(self.common_token(MyParser.INDENT, spaces))
else:
while len(self.indents) > 0 and self.indents[-1] > indent:
self.emit_token(self.create_dedent())
del self.indents[-1]
};
The SPACES
lexer rule fragment that NEWLINE
uses is here:
fragment SPACES
: [ \t]+
;
I feel I should also add that both SPACES
and COMMENTS
are ultimately being skipped by the grammar, but only after the NEWLINE
lexer rule is declared, which, as far as I know, should mean that there are no adverse effects from that, but I wanted to include it just in case.
SKIP_
: ( SPACES | COMMENT ) -> skip
;
When the input file is run without any empty lines between statements, everything runs as it should. However, if there are empty lines in my file (such as between import statements and variable assignement), I get the following errors:
line 15:4 extraneous input '\r\n ' expecting {<EOF>, 'from', 'import', NEWLINE, NAME}
line 15:0 extraneous input '\r\n' expecting {<EOF>, 'from', 'import', NEWLINE, NAME}
As I said before, when line feeds are omitted in my input file, the grammar and my ExtendedListener
perform as they should, so the problem is definitely with the \r\n
not being matched by the NEWLINE
lexer rule - even the error statement I get says that it does not match alternative NEWLINE
.
The AST produced by my grammar looks like this:
I would really appreciate any help with this since I cannot see why my NEWLINE
lexer rule woud fail to match \r\n
as it should and I would like to allow empty lines in my DSL.