4

I am wondering how I can lazily read a large XML file that doesn't fit into memory in Java. Let's assume the file is correctly formatted and we don't have to make a first pass to check this. Does someone know how to do this in Java?

Here is my fake file (real file is a Wikipedia dump which is 50+ Gb):

<pages>
  <page>
    <text> some data ....... </text>
  </page>
  <page>
    <text> MORE DATA ........ </text>
  </page>
</pages>

I was trying this with an XML library that is supposed to be able to do this but it's loading the whole thing into memory >:O

DOMParser domParser = new DOMParser();
//This is supposed to make it lazy-load the file, but it's not working
domParser.setFeature("http://apache.org/xml/features/dom/defer-node-expansion", true);
//Library says this needs to be set to use defer-node-expansion
domParser.setProperty("http://apache.org/xml/properties/dom/document-class-name", "org.apache.xerces.dom.DocumentImpl");

//THIS IS LOADING THE WHOLE FILE
domParser.parse(new InputSource(wikiXMLBufferedReader));

Document doc = domParser.getDocument();
NodeList pages = doc.getElementsByTagName("page");

for(int i = 0; i < pages.getLength(); i++) {
    Node pageNode = pages.item(i);
    //do something with page nodes
}

Do anyone know how to do this? Or what am I doing wrong in my attempt with this particular Java XML library?

Thanks.

dda
  • 6,030
  • 2
  • 25
  • 34
anthonybell
  • 5,790
  • 7
  • 42
  • 60
  • While defer-node-expansion delays expanding a node until you use it, it doesn't clear it from memory afterwards, so it doesn't do what you want. – Michael Kay Nov 18 '15 at 09:04

2 Answers2

4

You should be looking at SAX parsers in Java. DOM parsers are built to read the entire XMLs, load into memory, and create java objects out of them. SAX parsers serially parse XML files and use an event based mechanism to process the data. Look at the differences here.

Here's a link to a SAX tutorial. Hope it helps.

Vinay Rao
  • 1,284
  • 9
  • 13
2

If you're prepared to buy a Saxon-EE license, then you can issue the simple query "copy-of(//page)", with execution options set to enable streaming, and it will return you an iterator over a sequence of trees each rooted at a page element; each of the trees will be fetched when you advance the iterator, and will be garbage-collected when you have finished with it. (That's assuming you really want to do the processing in Java; you could also do the processing in XQuery or XSLT, of course, which would probably save you many lines of code.)

If you have more time than money, and want a home-brew solution, then write a SAX filter which accepts parsing events from the XML parser and sends them on to a DocumentBuilder; every time you hit a startElement event for a page element, open a new DocumentBuilder; when the corresponding endElement event is notified, grab the tree that has been built by the DocumentBuilder, and pass it to your Java application for processing.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164