I think you're using the unbuffered streams correctly and what you see is the expected, desired result of using those streams. But I think you may have expectations of them that they aren't obligated to meet.
Below is test code for us to poke with sticks. I'm using System.in
for the input, so I modified the grammar to account for the newline characters between the word tokens.
Streaming.g
grammar Streaming;
fox : 'quick' NL 'brown' NL 'fox' NL DONE NL;
DONE : 'done';
NL : '\r'? '\n';
StreamingTest.java
import org.antlr.v4.runtime.CommonToken;
import org.antlr.v4.runtime.CommonTokenFactory;
import org.antlr.v4.runtime.Token;
import org.antlr.v4.runtime.UnbufferedCharStream;
import org.antlr.v4.runtime.UnbufferedTokenStream;
import org.antlr.v4.runtime.tree.TerminalNode;
public class StreamingTest {
public static void main(String[] args) throws Exception {
lex();
parse();
}
private static void lex() {
System.out.println("-> Reading from lexer:");
UnbufferedCharStream input = new UnbufferedCharStream(System.in);
StreamingLexer lexer = new StreamingLexer(input);
lexer.setTokenFactory(new CommonTokenFactory(true));
Token t;
//read each token until hitting input "done"
while ((t = lexer.nextToken()).getType() != StreamingLexer.DONE){
if (t.getText().trim().length() == 0){
System.out.println("-> " + StreamingLexer.tokenNames[t.getType()]);
} else {
System.out.println("-> " + t.getText());
}
}
}
private static void parse() {
System.out.println("-> Reading from parser:");
UnbufferedCharStream input = new UnbufferedCharStream(System.in);
StreamingLexer lexer = new StreamingLexer(input);
lexer.setTokenFactory(new CommonTokenFactory(true));
StreamingParser parser = new StreamingParser(new UnbufferedTokenStream<CommonToken>(lexer));
parser.addParseListener(new StreamingBaseListener(){
@Override
public void visitTerminal(TerminalNode t) {
if (t.getText().trim().length() == 0){
System.out.println("-> " + StreamingLexer.tokenNames[t.getSymbol().getType()]);
} else {
System.out.println("-> " + t.getText());
}
}
});
parser.fox();
}
}
Below is a mix of the input and output as they're provided to/received from the lexer and parser in the program above. Each line of output is prefixed with ->
. I'll explain why things are the way they are after that.
Input & Output
-> Reading from lexer:
quick
-> quick
brown
-> NL
-> brown
fox
-> NL
-> fox
done
-> NL
-> Reading from parser:
quick
brown
-> quick
-> NL
fox
-> brown
-> NL
done
-> fox
-> NL
-> done
-> NL
The first thing I notice is that the lexer immediately received quick
NL
for input, but only provided a token for quick
. The reason for this discrepancy is that the UnbufferedCharStream
reads ahead one more character (even though it has a perfectly good NL
token ready for me!) because it won't sit on an empty look-ahead character buffer. Alas, the unbuffered stream is buffered. According to the Javadoc comment in the class itself:
"Unbuffered" here refers to fact that it doesn't buffer all data, not that's it's on demand loading of char.
This extra read translates into waiting on the stream for more input, which explains why the lexer is one token behind for the rest of the input.
Now on to the parser. Why does it lag behind two tokens to the lexer's one? Simple: because UnbufferedTokenStream
won't sit on an empty look-ahead buffer either. But it can't create that next token until a) it has a spare token from the lexer and b) the lexer's UnbufferedCharStream
reads its own look-ahead character. In effect, it's the lexer's one-character "lag" plus a one-token "lag."
It appears that getting "lag-free," data-on-demand streams in ANTLR v4 means writing your own. But it seems to me that the existing streams work as expected.
Is Antlr suitable for parsing data from streams that don't have EOF right after the text to parse?
I can't answer that with confidence for ANTLR 4 yet. It seems easy enough to write a token stream that doesn't buffer ahead until it's needed (override UnbufferedTokenStream
's consume
to skip calling sync
), but the character stream gets called by classes that do their own reading ahead regardless of anyone's buffering. Or so it seems. I'll keep digging into this as best I can, but it may require learning the official way to do this.