1

I'm using an antlr for a simple CSV parser. I'd like to use it on a 29gig file, but it runs out of memory on the ANTLRInputStream call:

    CharStream cs = new ANTLRInputStream(new BufferedInputStream(input,8192));
    CSVLexer lexer = new CSVLexer(cs);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    CSVParser parser = new CSVParser(tokens);
    ParseTree tree = parser.file();
    ParseTreeWalker walker = new ParseTreeWalker();
    walker.walk(myListener, tree);

I tried to change it to be an unbuffered stream

    CharStream cs= new UnbufferedCharStream(input)
    CSVLexer lexer = new CSVLexer(cs);
    lexer.setTokenFactory(new CommonTokenFactory(true));
    TokenStream tokens = new UnbufferedTokenStream(lexer);
    CSVParser parser = new CSVParser(tokens);

When I run the walker.walk() function it does not process any records. If I try something like

    parser.setBuildParseTree(false);
    parser.addParseListener(myListener);

It also fails. It seems like I have to parse the file differently if I don't build a parse tree, so I would like documentation or examples of how to do this.

If I don't use unbuffered char stream but I do use unbuffered token stream it gives error: Unbuffered stream cannot know its size. I tried different permutations but usually there is a java heap error or a "GC overhead limit exceeded".

I'm using this csv grammar

Ivan Kochurkin
  • 4,413
  • 8
  • 45
  • 80
ForeverConfused
  • 1,607
  • 3
  • 26
  • 41
  • 3
    This is a Java, not Antlr, problem. Use the Java CLI switches to greatly increase the amount of memory available to Java. BTW, setting `setBuildParseTree` to false means that your walker will have nothing to walk. – GRosenberg Apr 13 '16 at 17:29
  • I have tried allocating 15 gigs with cli flag but still craps out. I'm not sure what is issue. Do you know how to loop through parse tree without a parse tree walker? – ForeverConfused Apr 13 '16 at 19:14
  • 1
    Absent an overriding reason, best approach would be to subdivide the input text in to manageable sized chunks. Otherwise, you will need Java memory likely as large as the input text, if not well more. That is, CommonTokens nominally contain a lazy copy of their underlying text, backed by the input text, to support Token#getText(). If `getText()` support is desired, then Java min memory requirement is the sum size of the input text, token and parse-tree overheads, Antlr runtime, and your program. – GRosenberg Apr 13 '16 at 22:34
  • KvanTTT, thanks for the question. @GRosenberg, thanks for clarifying what I suspected, namely that nont building a parse tree will prevent the walker from calling the listeners. I'm trying to come up with a good methodology to break the input text into chunks. This seems like a common problem, and I wonder/suspect that ANTLR4 could be instructed to "chuck the parse tree" and free the associated memory and start parsing at a new rule. Or perhaps the grammar AND the Java support code could be structured to do this. Thats the solution I'm looking into now. – Ross Youngblood Feb 25 '17 at 01:36

1 Answers1

1

I already answered a similar question here: https://stackoverflow.com/a/26120662/4094678

It seems like I have to parse the file differently if I don't build a parse tree, so I would like documentation or examples of how to do this.

Look for grammar actions in antlr book - like said in the linked answer, forget listener and visitor and building a parse tree. Even if this is not enough, split the file in a number of smaller ones and then parse each of them.
And of course as mentioned in the comments increase java vm memory.

Community
  • 1
  • 1
cantSleepNow
  • 9,691
  • 5
  • 31
  • 42
  • Of course grammar actions would work... but if a solution implemented works with parse tree listeners, and only during testing is it discovered that the solution breaks with LARGE files, a generic coding pattern for breaking up the input will likely be easier to implement that re-writeing all of the listener methods into grammar actions. – Ross Youngblood Feb 25 '17 at 01:39