Using Antlr for parsing data from never-ending stream

Question

Is Antlr suitable for parsing data from streams that don't have EOF right after the text to parse? According to my observation, the lexer does not emit the current token until the first character of next token is received. On top of that - the parser seems not to emit the rule until the first token of next rule is received. Here is a simple grammar I tried:

fox: 'quick' 'brown' 'fox' '\r'? '\n' ;

Then I used the generated parser with UnbufferedCharStream and UnbufferedTokenStream:

  CharStream input = new UnbufferedCharStream(is);
  MyLexer lex = new MyLexer(input);
  lex.setTokenFactory(new CommonTokenFactory(true));
  TokenStream tokens = new UnbufferedTokenStream(lex);
  MyParser parser = new MyParser(tokens);
  MyParser.FoxContext fox = parser.fox();

when the stream gets 'quick' - nothing happens.

when 'b' comes in - entering rule 'fox'

then 'roun' - nothing (2 tokens are in the stream - none of them is known to leser yet!)

only after 'f' the listener visits the first token: 'quick'

then - nothing on 'ox'

on new line (unix): visit token 'brown'

Now the stream has all data (4 tokens), but only 2 tokens are recognized.

I found that in order to push those tokens through the system the stream can emit 2 tokens, that is any tokens known to the grammar. It could be 2 extra new lines, or let's say 'fox' and 'brown'. Only then the tokens 'fox' and '\n' get visited, the parser exits rule 'fox' and parsing gets finished.

Is that a bug or a feature? Is there a way to eliminate that lag?

Thanks!

Some of the problems you're having appear to be addressed in answers to [this similar question](http://stackoverflow.com/questions/5110507/interactive-antlr). — user1201210, Feb 14 '13 at 10:30
I'm not familiar with Antlr 3. Apparently it didn't have Unbuffered streams. Antlr 4 does have them, but they either don't work right, or I'm not using them right... — AndreyP, Feb 14 '13 at 21:52
This was useful information. I'm looking at ways to limit the parse tree size, and any information on how the token streams data flow works is helpful to my understanding of a solution. — Ross Youngblood, Mar 01 '17 at 15:34

score 7 · Answer 1 · answered Feb 18 '13 at 03:58

7

The ANTLR 4 book was originally going to contain an example of parsing a streaming input, but I argued against it due to the severe complications that will inevitably arise from the use of an adaptive unlimited lookahead parser for something like this.

ANTLR 4 has no guaranteed lookahead bound (and no way to tell it to look for or even attempt to enforce one), so any implementation that operates on a blocking stream has the possibility of deadlock without returning information about the parse leading up to that point. I wouldn't even entertain the possibility of parsing a streaming input unless I saw an intermediate buffer in place first.

Take all available (or previously unparsed) input and place it in a String or char[].
Create an ANTLRInputStream for the buffer.
Attempt to lex/parse this stream, which will have an implicit EOF on the end.

The result of the parse will tell you whether to discard the results to that point, or hold on to them to retry when more data is available:

If no syntax error occurs, the input was successfully parsed, and you can parse the next section of input when it becomes available later.
If a syntax error is reported before the EOF token is consumed, then a syntax error appears in the actual input so you'll want to handle it (report it to the user, etc...).
If a syntax error is reported at the point where the EOF token is consumed then additional input may resolve the problem - ignore the results of the current parse, and then retry once more data is available from the input stream.

answered Feb 18 '13 at 03:58

Sam Harwell

97,721
20
209
280

1

The problem with the suggested approach starts with step (1), which turns into Catch 22: If I have no information about the expected size of incoming data, other that it ends let's say with 'fox' and may have 'brown' or 'red' before that - how can I determine what's available without parsing? If data come from a socket over raw TCP - how do I know if what I have is all available or there is more data coming over?... – AndreyP Feb 18 '13 at 04:29
BTW, reading from a blocking stream and waiting for data seems to be a valid case to me. Think of BufferedReader.readLine() method: it will never end if used with System.in, and will end up with timeout exception if used with Socket. Even without guarantee that read ahead will never happens - it could be a nice feature if it worked at least with some grammars. – AndreyP Feb 18 '13 at 13:48
ANTLR 4 has very predictable behavior from the perspective of parse results, but it has *almost completely unpredictable* behavior regarding the number of characters/tokens it examines during the parsing process. You could do a year of testing with a million different inputs, only to finally ship a product which deadlocks for 99% of your users. The only guarantee is it won't attempt further lookahead after reaching an EOF (character or token). – Sam Harwell Feb 18 '13 at 17:11
Makes sense. I rest my case. Anyway, I thought it was worth asking ;) – AndreyP Feb 18 '13 at 20:15
2

I should point out that Sam is using "streaming" when "interactive" is a better term. ANTLR v4 can process never-ending socket data no problem. We just decided that we could not make it suitable for interactive calculators and so on. – Terence Parr Mar 10 '13 at 17:48
I'm thinking that instead of COPYING the input stream into a buffer, creating a class that just inserts a pseudo EOF at specific buffer positions (every "n" bytes) would suffice. Is there already a solution that does this? – Ross Youngblood Mar 01 '17 at 15:40
1

Putting tens or hundreds of gigabytes of a file into String makes no sense. The Lexer should handle its own buffer, and the Parser should handle its own buffer, using only the amount of memory it actually needs. I understand that lookahead can be arbitrary large _in general_, but for many grammars we don't need arbitrary lookahead. – TFuto Jan 10 '21 at 10:06

user1201210 · Answer 2 · 2013-02-15T09:51:34.340

I think you're using the unbuffered streams correctly and what you see is the expected, desired result of using those streams. But I think you may have expectations of them that they aren't obligated to meet.

Below is test code for us to poke with sticks. I'm using System.in for the input, so I modified the grammar to account for the newline characters between the word tokens.

Streaming.g

grammar Streaming;

fox   : 'quick' NL 'brown' NL 'fox' NL DONE NL;
DONE  : 'done';
NL    : '\r'? '\n';

StreamingTest.java

import org.antlr.v4.runtime.CommonToken;
import org.antlr.v4.runtime.CommonTokenFactory;
import org.antlr.v4.runtime.Token;
import org.antlr.v4.runtime.UnbufferedCharStream;
import org.antlr.v4.runtime.UnbufferedTokenStream;
import org.antlr.v4.runtime.tree.TerminalNode;

public class StreamingTest {
    public static void main(String[] args) throws Exception {
        lex();
        parse();
    }

    private static void lex() {
        System.out.println("-> Reading from lexer:");
        UnbufferedCharStream input = new UnbufferedCharStream(System.in);
        StreamingLexer lexer = new StreamingLexer(input);
        lexer.setTokenFactory(new CommonTokenFactory(true));

        Token t;

        //read each token until hitting input "done"
        while ((t = lexer.nextToken()).getType() != StreamingLexer.DONE){
            if (t.getText().trim().length() == 0){
                System.out.println("-> " + StreamingLexer.tokenNames[t.getType()]);
            } else { 
                System.out.println("-> " + t.getText());
            }
        }
    }

    private static void parse() {
        System.out.println("-> Reading from parser:");
        UnbufferedCharStream input = new UnbufferedCharStream(System.in);
        StreamingLexer lexer = new StreamingLexer(input);
        lexer.setTokenFactory(new CommonTokenFactory(true));

        StreamingParser parser = new StreamingParser(new UnbufferedTokenStream<CommonToken>(lexer));
        parser.addParseListener(new StreamingBaseListener(){
            @Override
            public void visitTerminal(TerminalNode t) {
                if (t.getText().trim().length() == 0){
                    System.out.println("-> " + StreamingLexer.tokenNames[t.getSymbol().getType()]);
                } else { 
                    System.out.println("-> " + t.getText());
                }
            }
        });

        parser.fox();
    }
}

Below is a mix of the input and output as they're provided to/received from the lexer and parser in the program above. Each line of output is prefixed with ->. I'll explain why things are the way they are after that.

Input & Output

-> Reading from lexer:
quick
-> quick
brown
-> NL
-> brown
fox
-> NL
-> fox
done
-> NL
-> Reading from parser:
quick
brown
-> quick
-> NL
fox
-> brown
-> NL
done
-> fox
-> NL

-> done

-> NL

The first thing I notice is that the lexer immediately received quick NL for input, but only provided a token for quick. The reason for this discrepancy is that the UnbufferedCharStream reads ahead one more character (even though it has a perfectly good NL token ready for me!) because it won't sit on an empty look-ahead character buffer. Alas, the unbuffered stream is buffered. According to the Javadoc comment in the class itself:

"Unbuffered" here refers to fact that it doesn't buffer all data, not that's it's on demand loading of char.

This extra read translates into waiting on the stream for more input, which explains why the lexer is one token behind for the rest of the input.

Now on to the parser. Why does it lag behind two tokens to the lexer's one? Simple: because UnbufferedTokenStream won't sit on an empty look-ahead buffer either. But it can't create that next token until a) it has a spare token from the lexer and b) the lexer's UnbufferedCharStream reads its own look-ahead character. In effect, it's the lexer's one-character "lag" plus a one-token "lag."

It appears that getting "lag-free," data-on-demand streams in ANTLR v4 means writing your own. But it seems to me that the existing streams work as expected.

Is Antlr suitable for parsing data from streams that don't have EOF right after the text to parse?

I can't answer that with confidence for ANTLR 4 yet. It seems easy enough to write a token stream that doesn't buffer ahead until it's needed (override UnbufferedTokenStream's consume to skip calling sync), but the character stream gets called by classes that do their own reading ahead regardless of anyone's buffering. Or so it seems. I'll keep digging into this as best I can, but it may require learning the official way to do this.

Thanks, tenterhook! So far it looks like unbuffered streams are not about eliminating lag, but about saving memory and avoiding keeping more data than needed... I'll look at antlr 4 source too. Apparently LexerATNSimulator does the magic... — AndreyP, Feb 17 '13 at 06:10
Your code directly uses a blocking input stream (`System.in`) with ANTLR 4, which I ***STRONGLY*** discourage due to the near impossibility of determining whether the resulting code will behave in a reliable manner. See my answer for additional info. — Sam Harwell, Feb 18 '13 at 04:04

score 2 · Answer 3 · answered Feb 18 '13 at 03:19

Apparently the root of the issue is not in Unbuffered*Streams. It's in Interpreters, like LexerATNSimulator.execATN() method. That method interprets the lexer as a state machine, moving from one tag to another once the first character of next tag is consumed. The similar algorithm is used in ParserATNSimulator, which deals with Tokens recognized by the Lexer. That's what causes that double lag. So, now I'm pretty much confident that Antlr 4 as it's implemented now cannot be used for parsing continuous, interactive data. Unlike Flex/Bison, where the lexer returns the tag right when the last characters possibly matching the tag. As the result - the parse() function ends right when the portion of data matching the grammar arrives. That provides nice ability to read exact amount of data, determined by the data structure when the size is not defined otherwise.

Using Antlr for parsing data from never-ending stream

3 Answers3

Streaming.g

StreamingTest.java

Input & Output

Linked