Antlr4 parser ends prematurely on misplaced token in Python 3.7

Question

I'm having an issue where if my parser finds a token that it cannot place in any rule it ends without explicitly reporting an error, even though there are more tokens left to place afterward. To be exact, the token is actually recognized (I have a rule which is an almost catch-all rule) but the token is misplaced and can't be covered by any rule. In this case, my parser ends successfully without reporting any errors (at least out loud).

This is the case I'm seeing: The code to parse:

.class public final Ld;
.super Ljava/lang/Object;
.source "java-style lambda group"

# interfaces
.implements Landroid/content/DialogInterface$OnClickListener;
<misplaced-tokens>

# static fields
.field public static final f:Ld;

.field public static final g:Ld;

...

(note <misplaced-tokens> token, which is actually five tokens - see below. I'm expecting parsing to error out here.)

Parsed tokens:

[@0,0:5='.class',<'.class'>,1:0]
[@1,7:12='public',<'public'>,1:7]
[@2,14:18='final',<'final'>,1:14]
[@3,20:22='Ld;',<QUALIFIED_TYPE_NAME>,1:20]
[@4,24:29='.super',<'.super'>,2:0]
[@5,31:48='Ljava/lang/Object;',<QUALIFIED_TYPE_NAME>,2:7]
[@6,50:56='.source',<'.source'>,3:0]
[@7,58:82='"java-style lambda group"',<STRING_LITERAL>,3:8]
[@8,85:96='# interfaces',<LINE_COMMENT>,channel=1,5:0]
[@9,98:108='.implements',<'.implements'>,6:0]
[@10,110:158='Landroid/content/DialogInterface$OnClickListener;',<QUALIFIED_TYPE_NAME>,6:12]
[@11,160:160='<',<'<'>,7:0]
[@12,161:169='misplaced',<IDENTIFIER>,7:1]
[@13,170:170='-',<'-'>,7:10]
[@14,171:176='tokens',<IDENTIFIER>,7:11]
[@15,177:177='>',<'>'>,7:17]
[@16,180:194='# static fields',<LINE_COMMENT>,channel=1,9:0]
[@17,196:201='.field',<'.field'>,10:0]
...

Parsing progress:

enter   parse, LT(1)=.class
enter   statement, LT(1)=.class
enter   classDirective, LT(1)=.class
consume [@0,0:5='.class',<30>,1:0] rule classDirective
enter   classModifier, LT(1)=public
consume [@1,7:12='public',<53>,1:7] rule classModifier
exit    classModifier, LT(1)=final
enter   classModifier, LT(1)=final
consume [@2,14:18='final',<56>,1:14] rule classModifier
exit    classModifier, LT(1)=Ld;
enter   className, LT(1)=Ld;
enter   referenceType, LT(1)=Ld;
consume [@3,20:22='Ld;',<1>,1:20] rule referenceType
exit    referenceType, LT(1)=.super
exit    className, LT(1)=.super
exit    classDirective, LT(1)=.super
exit    statement, LT(1)=.super
enter   statement, LT(1)=.super
enter   superDirective, LT(1)=.super
consume [@4,24:29='.super',<33>,2:0] rule superDirective
enter   superName, LT(1)=Ljava/lang/Object;
enter   referenceType, LT(1)=Ljava/lang/Object;
consume [@5,31:48='Ljava/lang/Object;',<1>,2:7] rule referenceType
exit    referenceType, LT(1)=.source
exit    superName, LT(1)=.source
exit    superDirective, LT(1)=.source
exit    statement, LT(1)=.source
enter   statement, LT(1)=.source
enter   sourceDirective, LT(1)=.source
consume [@6,50:56='.source',<32>,3:0] rule sourceDirective
enter   sourceName, LT(1)="java-style lambda group"
enter   stringLiteral, LT(1)="java-style lambda group"
consume [@7,58:82='"java-style lambda group"',<304>,3:8] rule stringLiteral
exit    stringLiteral, LT(1)=.implements
exit    sourceName, LT(1)=.implements
exit    sourceDirective, LT(1)=.implements
exit    statement, LT(1)=.implements
enter   statement, LT(1)=.implements
enter   implementsDirective, LT(1)=.implements
consume [@9,98:108='.implements',<31>,6:0] rule implementsDirective
enter   implementsName, LT(1)=Landroid/content/DialogInterface$OnClickListener;
enter   referenceType, LT(1)=Landroid/content/DialogInterface$OnClickListener;
consume [@10,110:158='Landroid/content/DialogInterface$OnClickListener;',<1>,6:12] rule referenceType
exit    referenceType, LT(1)=<
exit    implementsName, LT(1)=<
exit    implementsDirective, LT(1)=<
exit    statement, LT(1)=<
exit    parse, LT(1)=<

(Observe how parse is the main rule and is actually exited here, even though there are a bunch more tokens in the pipeline)

What I tried:

I tried reimplementing the default error strategy and error listener and added both to both lexer and parser, just to see if any of those breakpoints would get hit. No breakpoints to any and all overridden methods are ever hit (except sometimes reportAttemptingFullContext).

This is how I added the overrides:

def parseFile(self, filePath):
    errorListener = MyErrorListener()
    strategy = MyErrorStrategy()
    file = FileStream("file.smali")
    lexer = SmaliLexer(file)
    lexer.removeErrorListeners()
    lexer.addErrorListener(errorListener)
    lexer.addErrorListener(strategy)
    stream = CommonTokenStream(lexer)
    parser = SmaliParser(stream)
    parser.removeErrorListeners()
    parser.addErrorListener(errorListener)
    parser.addErrorListener(strategy)
    tree = parser.parse()
    ...

My setup is as follows:

Windows 10 OS
Python 3.7
Antlr4 v4.8 - antlr-4.8-complete.jar
pip-installed runtime: antlr4_python3_runtime-4.8-py3-none-any.whl

I would really appreciate any help on how to make Antlr4 actually take into account the overridden listener and strategy so that I can both report the errors for debugging but also to be able to handle them differently. Thanks!

How does your `parse` rule look like? Does it have an `EOF` at the end? I guess it doesn't, and in that case suggest you to add it. — Bart Kiers, Aug 31 '20 at 19:39
ANTLR generated parser by default tries to recover from errors, in a way that you might not like. So grab an ANTLR book to study error recovery and you should then understand how to move on. The internet might have previous discussions, but not easy to find the exact one for you. (I participated in an ANTLR 3 thread https://stackoverflow.com/a/9389785/11182) — Lex Li, Aug 31 '20 at 20:09
Thank you so much for both tips but @BartKiers answer actually worked! I could have sworn I have already tried it unsuccessfully but I retried it anyways. Thinking back I think I have made a stupid mistake of putting 'or' before the EOF (like `... | EOF` which of course didn't work). Thanks to both of you! — Andrej Mohar, Aug 31 '20 at 20:23
Cool, glad to hear that solved it. I will add a quick answer in case others stumble upong this Q&A and don't read the comments — Bart Kiers, Sep 01 '20 at 08:04

score 1 · Accepted Answer · answered Sep 01 '20 at 08:07

Antlr4 parser ends prematurely

This could happen when the rule you invoke, parse in your case, is not "anchored" by the built-in EOF token:

parse
 : expression
 ;

expression
 : expression '+' expression
 | NUMBER
 ;

In the case above, the generated parser will happily parse 1+2 when the input is 1+2 3.

If you want to force the parser to consume all tokens from the input stream, add EOF to your start rule:

parse
 : expression EOF
 ;

Antlr4 parser ends prematurely on misplaced token in Python 3.7

What I tried:

1 Answers1