XML Parsing: Parsing the entire xml for one field

Question

I have a very large XML which I receive as input. From this XML I just need a single child element. Parsing the entire XML to retrieve just one element seems like an performance overkill. Are there any better approaches to resolve this issue?

One approach would be to use the DocumentBuilder API to parse the XML and then using XPath to retrieve the desired field. But the parse method will still unnecessarily parse the entire xml. Is there an overloaded parse method in any implementation of parser which takes the xpath and parses the XML only according to the XPath.

Consider using STaX. http://stackoverflow.com/questions/7215931/reading-huge-xml-file-using-stax-and-xpath — Brett Okken, Apr 08 '14 at 12:58

score 1 · Answer 1 · answered Apr 08 '14 at 12:58

What you need is a SAX parser or a similar fast parser. SAX parsers do not parse the entire XML, they just parse the xml to the point until they find the element they are looking for.

You can read about SAX parsers in wikipedia's link. Also have a look at the java docs for SAX parser

score 1 · Answer 2 · edited May 23 '17 at 11:57

Although there is no way around parsing for the proper treatment of your XML data, there is definitely a way around building an in-memory representation of the entire document. Java offers SAX parsing, which is event-based. You can implement an event handler for XML events, ignoring everything on the way to the content that you need, and stopping after retrieving the part that you are looking for.

Here is a tutorial from Oracle showing how to use SAX APIs to retrieve counts of individual tags without building a document in memory.

Since most XPath processors work with SAX as well, you could potentially feed events to an XPath processor, and look for the desired tag in that way, too. However, this may be an overkill for a situation when you need to fetch a single element.

score 0 · Answer 3 · answered Apr 08 '14 at 13:00

XPath operates over the document object model. So you have to have a DOM in order to evaluate an XPath expression. Otherwise what would it validate against?

So XPath is out if you don't want to parse the document. Your other options are fast SAX parsing, where you ignore all SAX parsing events until you get to the element that you want, extract the text that you want, and then abandon the rest of the parsing process.

The other option is to go way simpler: use grep.

XML Parsing: Parsing the entire xml for one field

3 Answers3