3

Is the java parser generated by ANTLR capable of streaming arbitrarily large files?

I tried constructing a Lexer with a UnbufferedCharStream and passed that to the parser. I got an UnsupportedOperationException because of a call to size on the UnbufferedCharStream and the exception contained an explained that you can't call size on an UnbufferedCharStream.

    new Lexer(new UnbufferedCharStream( new CharArrayReader("".toCharArray())));
    CommonTokenStream stream = new CommonTokenStream(lexer);
    Parser parser = new Parser(stream);

I basically have a file I exported from hadoop using pig. It has a large number of rows separated by '\n'. Each column is split by a '\t'. This is easy to parse in java as I use a buffered reader to read each line. Then I split by '\t' to get each column. But I also want to have some sort of schema validation. The first column should be a properly formatted date, followed some price columns, followed by some hex columns.

When I look at the generated parser code I could call it like so

    parser.lines().line()

This would give me a List which conceptually I could iterate over. But it seems that the list would have a fixed size by the time I get it. Which means the parser probably already parsed the entire file.

Is there another part of the API that would allow you to stream really large files? Like some way of using the Visitor or Listener to get called as it is reading the file? But it can't keep the entire file in memory. It will not fit.

dodtsair
  • 178
  • 1
  • 6

1 Answers1

5

You could do it like this:

InputStream is = new FileInputStream(inputFile);//input file is the path to your input file
ANTLRInputStream input = new ANTLRInputStream(is);
GeneratedLexer lex = new GeneratedLexer(input);
lex.setTokenFactory(new CommonTokenFactory(true));
TokenStream tokens = new UnbufferedTokenStream<CommonToken>(lex);
GeneratedParser parser = new GeneratedParser(tokens);
parser.setBuildParseTree(false);//!!
parser.top_level_rule();

And if the file is quite big, forget about listener or visitor - I would be creating object directly in the grammar. Just put them all in some structure (i.e. HashMap, Vector...) and retrieve as needed. This way creating the parse tree (and this is what really takes a lot of memory) is avoided.

cantSleepNow
  • 9,691
  • 5
  • 31
  • 42
  • I have a solution implemented with Parse Tree Listeners. It's not clear to me if the above solution where I don't generate a parse tree, will call the listeners. It seems that it wont. Creating objects in the grammar puts non grammar stuff in the grammar definition files :(. – Ross Youngblood Feb 25 '17 at 01:03
  • @RossYoungblood You are right, there are neither listeners nor visitors.And yes, it's non-grammar stuff (it's called grammar actions) and it's perfectly fine. In the antlr book there is even an example how one could build a calculator that way – cantSleepNow Feb 25 '17 at 13:58
  • 1
    I know How to use grammar actions, I just don't want to. I want to solve the issue of big files with parse tree listeners. That's the path I'm investigating now. – Ross Youngblood Feb 26 '17 at 07:44