1

I am trying to parse very large gzip compressed (10+GB) file in python3. Instead of creating the parse tree, instead I used embedded actions based on the suggestions in this answer.

However, looking at the FileStream code it wants to read the entire file and then parse it. This will not work for big files.

So, this is a two part question.

  • Can ANTLR4 use a file stream, probably custom, that allows it to read chunks of the file at a time? What should the class interface look like?
  • Predicated on the above having "yes", would that class need to handle seek operations, which would be a problem if the underlying file is gzip compressed?
David R.
  • 855
  • 8
  • 17

1 Answers1

3

Short anser: no, not possible.

Long(er) answer: ANTLR4 can potentially use unlimited lookahead, so it relies on the stream to seek to any position with no delay or parsing speed will drop to nearly a hold. For that reason all runtimes use a normal file stream that reads in the entire file at once.

There were discussions/attempts in the past to create a stream that buffers only part of the input, but I haven't heard of anything that actually works.

Mike Lischke
  • 48,925
  • 16
  • 119
  • 181
  • Suppose I know what the structure of the file is and I can estimate max token size, max token recursion, etc. I was looking at how `InputStream.GetText` gets called and noticed small jumps back and forth. I am thinking if I can estimate how far back the parser would go I should be able to maintain a buffer (loading into buffer shouldn't be a problem, but holding "old" data seems the key) Would those be sufficient? What else might get me in this scenario? – David R. Sep 01 '20 at 15:22
  • You cannot make the parser use only a limited amount of lookahead, but you might be lucky and get away with unbuffered [character](https://www.antlr.org/api/Java/org/antlr/v4/runtime/UnbufferedCharStream.html) and [token](https://www.antlr.org/api/Java/org/antlr/v4/runtime/UnbufferedTokenStream.html) streams. – Mike Lischke Sep 01 '20 at 18:07
  • What I am thinking is that I can maintain a buffer. Lookahead will not be a problem, but to manage the memory I would need to shrink the buffer from the front from time to time. I would need to know t which point the front end part of the buffer is irrelevant. Any suggestions? Would the Token stream know when this is the case? – David R. Sep 01 '20 at 19:23
  • The char stream maintains an index into the original source. Nothing before that is relevant anymore (unless you do a lookback yourself, of course). But the unbuffered streams handle that for you, so I see no reason why you should maintain an own buffer. If the existing stream do not suffice for you (e.g. because you want to use a file mapping) then implement your own stream, model after the existing stream. – Mike Lischke Sep 02 '20 at 06:56