fastest way to query xml in java

Question

What is the fastest way to query a huge XML file in java,

DOM - xpath : this is taking lot of time,

     DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
     docBuilderFactory.setNamespaceAware(true);

     DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
     Document document = docBuilder.parse(new File("test.xml"));

     XPath xpath = XPathFactory.newInstance().newXPath();

     String xPath = "/*/*[@id='ABCD']/*/*";

     XPathExpression expr = xpath.compile(xPath);
     //this line takes lot of time
     NodeList result = (NodeList)expr.evaluate(document, XPathConstants.NODESET);

with last line in code, program finishes in 40 secs and without it in 1 second.

SAX : I don't know if this can be used for query, on internet I am only able to find the examples of parsing.

What are the other options to make query faster, the size of my xml file is around 5MB. Thnx

The fastest (as has been repetitively proven) is vtd-xml (http://vtd-xml.sf.net) — vtd-xml-author, Jul 18 '13 at 19:29

bdoughan · Answer 1 · 2013-06-25T10:52:27.583

If your id attributes are of type xs:ID and you have an XML schema for your document then you can use the Document.getElementById(String) method. I will demonstrate below with an example.

XML Schema

<?xml version="1.0" encoding="UTF-8"?>
<schema 
    xmlns="http://www.w3.org/2001/XMLSchema" 
    targetNamespace="http://www.example.org/schema" 
    xmlns:tns="http://www.example.org/schema" 
    elementFormDefault="qualified">

    <element name="foo">
        <complexType>
            <sequence>
                <element ref="tns:bar" maxOccurs="unbounded"/>
            </sequence>
        </complexType>
    </element>

    <element name="bar">
        <complexType>
            <attribute name="id" type="ID"/>
        </complexType>
    </element>

</schema>

XML Input (input.xml)

<?xml version="1.0" encoding="UTF-8"?>
<foo xmlns="http://www.example.org/schema">
    <bar id="ABCD"/>
    <bar id="EFGH"/>
    <bar id="IJK"/>
</foo>

Demo

You will need to set the instance of Schema on the DocumentBuilderFactory to get everything to work.

import java.io.File;
import javax.xml.XMLConstants;
import javax.xml.parsers.*;
import javax.xml.validation.*;
import org.w3c.dom.*;

public class Demo {

    public static void main(String[] args) throws Exception {
        SchemaFactory sf = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
        Schema schema = sf.newSchema(new File("src/forum17250259/schema.xsd"));

        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        dbf.setNamespaceAware(true);
        dbf.setSchema(schema);
        DocumentBuilder db = dbf.newDocumentBuilder();
        Document document = db.parse(new File("src/forum17250259/input.xml"));

        Element result = document.getElementById("EFGH");
        System.out.println(result);
    }

}

Suresh Atta · Answer 2 · 2013-06-22T11:50:47.833

1

Have a look at SAX api , because it is the fastest and least memory-intensive mechanism that is currently available for dealing with XML documents

edited Jun 22 '13 at 11:50

answered Jun 22 '13 at 11:37

Suresh Atta

120,458
37
198
307

sax doesn't seem to support xpath. – vtd-xml-author Jul 18 '13 at 19:30

score 1 · Answer 3 · answered Jun 22 '13 at 12:09

It depends on the type of query you want to perform.

If, for example, you just want to find a node by ID and then read out it's textual contents SAX will be very fast but it'll require a little coding to write a SAX handler (probably extended from this).

If, on the other hand, you want to perform a fairly complex query along the lines of "get the third ancestor node of foo where foo has a child bah" you're pretty much going to have to use xpath as the SAX handler would be impossibly complex.

score 1 · Answer 4 · answered Jul 18 '13 at 19:39

1

The Jdk's default XPath engine is notorious for its slow performance. You should consider Jaxen or vtd-xml. See the following rticles....

http://fahdshariff.blogspot.com/2010/08/faster-xpaths-with-vtd-xml.html

answered Jul 18 '13 at 19:39

vtd-xml-author

3,319
4
22
30

You may want to mention that vtd-xml is a commercial productc. – Thorbjørn Ravn Andersen Sep 22 '15 at 15:08
Vtd-xml is licensed under gpl,but it is also licensed commercially – vtd-xml-author Sep 22 '15 at 22:39

score -2 · Answer 5 · answered Jun 22 '13 at 11:48

-2

Give the Jackson lib a try, it's one of the fastes xml/json p

answered Jun 22 '13 at 11:48

Danny02

52
4

fastest way to query xml in java

5 Answers5