I am wondering how I can lazily read a large XML file that doesn't fit into memory in Java. Let's assume the file is correctly formatted and we don't have to make a first pass to check this. Does someone know how to do this in Java?
Here is my fake file (real file is a Wikipedia dump which is 50+ Gb):
<pages>
<page>
<text> some data ....... </text>
</page>
<page>
<text> MORE DATA ........ </text>
</page>
</pages>
I was trying this with an XML library that is supposed to be able to do this but it's loading the whole thing into memory >:O
DOMParser domParser = new DOMParser();
//This is supposed to make it lazy-load the file, but it's not working
domParser.setFeature("http://apache.org/xml/features/dom/defer-node-expansion", true);
//Library says this needs to be set to use defer-node-expansion
domParser.setProperty("http://apache.org/xml/properties/dom/document-class-name", "org.apache.xerces.dom.DocumentImpl");
//THIS IS LOADING THE WHOLE FILE
domParser.parse(new InputSource(wikiXMLBufferedReader));
Document doc = domParser.getDocument();
NodeList pages = doc.getElementsByTagName("page");
for(int i = 0; i < pages.getLength(); i++) {
Node pageNode = pages.item(i);
//do something with page nodes
}
Do anyone know how to do this? Or what am I doing wrong in my attempt with this particular Java XML library?
Thanks.