How to efficiently make a large XML file searchable in a web application?

Question

I have an XML document and I need to make it searchable via a webapp. The document is currently only 6mb.. but could be extrememly large, thus from my research SAX seems the way to go.

So my question is, given a search term do I:

Do I load the document in memory once (into a list of beans and then store it in memory)? And then search it when need be? or
Parse the document looking for the desired search term and only add the matches to the list of beans? And repeat this process with each search?

I am not that experienced with webapps, but I am trying to figure out the optimal way to approach this, does anyone with Tomcat, SAX and Java Web apps have any suggestions as to which would be optimum?

Regards, Nate

what do you want to search in xml file? just wondering whether simple regex could help? — Adi, Aug 31 '14 at 08:25
"Extremely large" means preprocessing. In SQL terms this would be avoiding full table scans. — Thorbjørn Ravn Andersen, Aug 31 '14 at 08:26
Can your XML be modified at runtime? How will you perform searches in your XML file? Will you use it as a (very) small database or similar? What kind of result do you expect from your query: the current value of the line searched, a small relevant part of the XML, another data outside the XML but indexed through your XML? There are several questions for us to have a concrete understanding of your real problem. I could even say you to load the XML into a String and perform the searches on that String only, but maybe it's not the best idea. — Luiggi Mendoza, Aug 31 '14 at 08:30
How often XML document changes and how often is read/write performed? — Jurica Krizanic, Aug 31 '14 at 08:31
What is the purpose of this huge XML file? How would your application interface with it (what would you search in it & how often)? Is it possible to split the XML up in segmented parts? Do you have a XSD for the XML? Does the content of the XML change at runtime? — Filip, Aug 31 '14 at 08:40

score 1 · Answer 1 · edited May 23 '17 at 12:28

Assuming your search field is a field that is known to you, for example let the structure of the xml be:

<a>....</a>
<x>
<y>search text1</y>
<z>search text2</z>
</x>
<b>...</b>

and say the search has to be made on the 'x' and its children, you can achieve this using STAX parser and JAXB.

To understand the difference between STAX and SAX, please refer:

When should I choose SAX over StAX?

Using these APIs you will avoid storing the entire document in the memory. Using STAX parser, you parse the document, when you encounter the 'x' tag load it into memory(java beans) using JAXB.

Note: Only x and its children will be loaded to memory, not the entire document parsed till now. Do not use any approaches that use DOM parsers.

Sample code to load only the part of the document where the search field is present.

XMLInputFactory xif = XMLInputFactory.newFactory();
StreamSource xml = new StreamSource("file");
XMLStreamReader xsr = xif.createXMLStreamReader(xml);
xsr.nextTag();
while(!xsr.getLocalName().equals("x")) {
    xsr.nextTag();
}

JAXBContext jc = JAXBContext.newInstance(X.class);
Unmarshaller unmarshaller = jc.createUnmarshaller();
JAXBElement<Customer> jb = unmarshaller.unmarshal(xsr, X.class);
xsr.close();

X x = jb.getValue();
System.out.println(x.y.content);

Now you have the field content to return the appropriate field. When the user again searches for the same field under 'x', give the results from the memory and avoid parsing the XML again.

score 1 · Accepted Answer · answered Aug 31 '14 at 15:20

When you say that your XML file could be very large, I assume you do not want to keep it in memory. If you want it to be searchable, I understand that you want indexed accesses, without a full read at each time. IMHO, the only way to achieve that is to parse the file and load the data in a lightweight file database (Derby, HSQL or H2) and add relevant indexes to the database. Databases do allow indexed search on off memory data, XML files do not.

score 0 · Answer 3 · answered Aug 31 '14 at 12:28

Searching the file using XPath or XQuery is likely to be very fast (quite fast enough unless you are talking thousands of transactions per second). What takes time is parsing the file - building a tree in memory so that XPath or XQuery can search it. So (as others have said) a lot depends on how frequently the contents of the file change. If changes are infrequent, you should be able to keep a copy of the file in shared memory, so the parsing cost is amortized over many searches. But if changes are frequent, things get more complicated. You could try keeping a copy of the raw XML on disk, and a copy of the parsed XML in memory, and keeping the two in sync. Or you could bite the bullet and move to using an XML database - the initial effort will pay off in the end.

Your comment that "SAX is the way to go" would only be true if you want to parse the file each time you search it. If you're doing that, then you want the fastest possible way to parse the file. But a much better way forward is to avoid parsing it afresh on each search.

How to efficiently make a large XML file searchable in a web application?

3 Answers3