5

What is the fastest way to query a huge XML file in java,

DOM - xpath : this is taking lot of time,

     DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
     docBuilderFactory.setNamespaceAware(true);

     DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
     Document document = docBuilder.parse(new File("test.xml"));

     XPath xpath = XPathFactory.newInstance().newXPath();

     String xPath = "/*/*[@id='ABCD']/*/*";

     XPathExpression expr = xpath.compile(xPath);
     //this line takes lot of time
     NodeList result = (NodeList)expr.evaluate(document, XPathConstants.NODESET);

with last line in code, program finishes in 40 secs and without it in 1 second.

SAX : I don't know if this can be used for query, on internet I am only able to find the examples of parsing.

What are the other options to make query faster, the size of my xml file is around 5MB. Thnx

Mahender Singh
  • 1,393
  • 3
  • 17
  • 23

5 Answers5

4

If your id attributes are of type xs:ID and you have an XML schema for your document then you can use the Document.getElementById(String) method. I will demonstrate below with an example.

XML Schema

<?xml version="1.0" encoding="UTF-8"?>
<schema 
    xmlns="http://www.w3.org/2001/XMLSchema" 
    targetNamespace="http://www.example.org/schema" 
    xmlns:tns="http://www.example.org/schema" 
    elementFormDefault="qualified">

    <element name="foo">
        <complexType>
            <sequence>
                <element ref="tns:bar" maxOccurs="unbounded"/>
            </sequence>
        </complexType>
    </element>

    <element name="bar">
        <complexType>
            <attribute name="id" type="ID"/>
        </complexType>
    </element>

</schema>

XML Input (input.xml)

<?xml version="1.0" encoding="UTF-8"?>
<foo xmlns="http://www.example.org/schema">
    <bar id="ABCD"/>
    <bar id="EFGH"/>
    <bar id="IJK"/>
</foo>

Demo

You will need to set the instance of Schema on the DocumentBuilderFactory to get everything to work.

import java.io.File;
import javax.xml.XMLConstants;
import javax.xml.parsers.*;
import javax.xml.validation.*;
import org.w3c.dom.*;

public class Demo {

    public static void main(String[] args) throws Exception {
        SchemaFactory sf = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
        Schema schema = sf.newSchema(new File("src/forum17250259/schema.xsd"));

        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        dbf.setNamespaceAware(true);
        dbf.setSchema(schema);
        DocumentBuilder db = dbf.newDocumentBuilder();
        Document document = db.parse(new File("src/forum17250259/input.xml"));

        Element result = document.getElementById("EFGH");
        System.out.println(result);
    }

}
bdoughan
  • 147,609
  • 23
  • 300
  • 400
1

Have a look at SAX api , because it is the fastest and least memory-intensive mechanism that is currently available for dealing with XML documents

Suresh Atta
  • 120,458
  • 37
  • 198
  • 307
1

It depends on the type of query you want to perform.

If, for example, you just want to find a node by ID and then read out it's textual contents SAX will be very fast but it'll require a little coding to write a SAX handler (probably extended from this).

If, on the other hand, you want to perform a fairly complex query along the lines of "get the third ancestor node of foo where foo has a child bah" you're pretty much going to have to use xpath as the SAX handler would be impossibly complex.

wobblycogs
  • 4,083
  • 7
  • 37
  • 48
1

The Jdk's default XPath engine is notorious for its slow performance. You should consider Jaxen or vtd-xml. See the following rticles....

http://fahdshariff.blogspot.com/2010/08/faster-xpaths-with-vtd-xml.html

vtd-xml-author
  • 3,319
  • 4
  • 22
  • 30
-2

Give the Jackson lib a try, it's one of the fastes xml/json p

Danny02
  • 52
  • 4