1

Is it possible to parse only let's say the first half of the file with antlr4? I am parsing large files and I am using UnbufferedCharStream and UnbufferedTokenStream.

I am not building a parse tree and I am using parse actions instead of visitor/listener patterns. With these I was able to save a significant amount of RAM and improve the parse speed.

However it still takes around 15s to parse the whole file. The parsed file is divided into two sections. The first half of the file has metadata, the second one is the actual data. The majority of the time is spent in the data section as there are more than 3m. lines to be parsed. The metadata section has only around 20,000 lines. Is it possible to parse only the first half, which would improve parse speed significantly? Is it possible to inject EOF manually after the metadata section?

How about dividing the file into two?

ste-fu
  • 6,879
  • 3
  • 27
  • 46

3 Answers3

0

How about you programatically extract only the part you want to parse and create a new tmp.extension file that you parse? It could look like this:

System.IO.File.WriteAllText(@"C:\Users\Path\tmp.extension", text);

After the parsing you can delete the tmp file and the original stays as it is.

System.IO.File.Delete(@"C:\Users\Path\tmp.extension");
Kiroul
  • 465
  • 2
  • 9
  • I am using the parser to validate uploaded files in a web application. So to create additional files on disk is not really a nice option. I would want to find a better one, e.g. in the way of modifying input stream. Thx. – Janez Vratanar Jul 13 '18 at 08:30
  • You could try to extract only the part you want and convert it into a stream, see [Convert String to System.IO.Stream](https://stackoverflow.com/questions/8047064/convert-string-to-system-io-stream). It is a very big string but if you can programatically extract only the part you want it might work – Kiroul Jul 13 '18 at 09:51
  • Yes i could, but then the Stream would be really memory exhaustive as the file are really big. This is why i am using UnbufferedCharStream and UnbufferedTokenStream. – Janez Vratanar Jul 13 '18 at 10:23
0

ANTLR4 creates recursive-decent parsers, with parse functions that can directly be invoked. Assume you have a grammar like this:

grammar t;

start: meta data EOF;
meta: x y z;

data: a b c+;

Your natural entry point would be the start rule (in your case that would be the rule for the entire file). But it's also possible to only invoke rule meta, which in your case could be the header part of the file. If you don't end this rule with EOF, your parser will just consume enough input to parse this particular part of the entire file.

Mike Lischke
  • 48,925
  • 16
  • 119
  • 181
  • This was the first alternative that i have already tried. But the problem is that if you do not end the file with the EOF then ANTLR4 does not guarantee it will consume enough tokens. There were cases where errors at the lower end of metadata sections were not reported. – Janez Vratanar Jul 13 '18 at 08:25
  • Well, that's the price you have to pay I guess, unless you split the file manually somehow and only feed the first part to the parser. Then you can of course also add EOF. – Mike Lischke Jul 14 '18 at 10:47
0

So, i was able to find a solution. I overrode the Emit method from the generated lexer so it finds the beginning of the second section and it manually injects EOF token, like this:

public override IToken Emit()
{
    string tokenText = base.Text;
    if (this.metaDataOnly && tokenText == "DATA")
        return base.EmitEOF();
    return base.Emit();
}