2

I have a XML parser using StAX and I am using it to parse a huge file. However, I want to bring the time down as low as possible. I am reading the values putting it into an array and sending it off to another function to evaluate. I am calling the displayName tag and it should go to the next xml as soon as it grabs the name instead of reading the whole xml file. I am looking for the fastest approach.

Java:


import java.io.File;

import java.io.FileNotFoundException;
import java.io.FileReader;
import java.util.Iterator;
import javax.xml.namespace.QName;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.*;

public class Driver {

    private static boolean bname;

    public static void main(String[] args) throws FileNotFoundException, XMLStreamException {

        File file = new File("C:\\Users\\Robert\\Desktop\\root\\SDKCode\\src\\main\\java\\com\\example\\xmlClass\\data.xml");


        parser(file);
    }

    public static void parser(File file) throws FileNotFoundException, XMLStreamException {

        bname = false;


        XMLInputFactory factory = XMLInputFactory.newInstance();


        XMLEventReader eventReader = factory.createXMLEventReader(new FileReader(file));


        while (eventReader.hasNext()) {

            XMLEvent event = eventReader.nextEvent();

            // This will trigger when the tag is of type <...>
            if (event.isStartElement()) {
                StartElement element = (StartElement) event;


                Iterator<Attribute> iterator = element.getAttributes();
                while (iterator.hasNext()) {
                    Attribute attribute = iterator.next();
                    QName name = attribute.getName();
                    String value = attribute.getValue();
                    System.out.println(name + " = " + value);
                }


                if (element.getName().toString().equalsIgnoreCase("displayName")) {
                    bname = true;
                }

            }


            if (event.isEndElement()) {
                EndElement element = (EndElement) event;


                if (element.getName().toString().equalsIgnoreCase("displayName")) {
                    bname = false;
                }


            }


            if (event.isCharacters()) {
                // Depending upon the tag opened the data is retrieved .
                Characters element = (Characters) event;

                if (bname) {
                    System.out.println(element.getData());
                }

            }
        }
    }
}

XML:

<?xml version="1.0" encoding="UTF-8"?>
<results
        xmlns="urn:www-collation-com:1.0"
        xmlns:coll="urn:www-collation-com:1.0"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:www-collation-com:1.0
              urn:www-collation-com:1.0/results.xsd">

    <WebServiceImpl array="1"
        guid="FFVVRJ5618KJRHNFUIRV845NRUVHR" xsi:type="coll:com.model.topology.app.web.WebService">
        <isPlaceholder>false</isPlaceholder>
        <displayName>server.servername1.siqom.siqom.us.com</displayName>
        <hierarchyType>WebService</hierarchyType>
        <hierarchyDomain>app.web</hierarchyDomain>
    </WebServiceImpl>
</results>

<?xml version="1.0" encoding="UTF-8"?>
<results
        xmlns="urn:www-collation-com:1.0"
        xmlns:coll="urn:www-collation-com:1.0"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:www-collation-com:1.0
              urn:www-collation-com:1.0/results.xsd">

    <WebServiceImpl array="1"
        guid="FFVVRJ5618KJRHNFUIRV845NRUVHR" xsi:type="coll:com.model.topology.app.web.WebService">
        <isPlaceholder>false</isPlaceholder>
        <displayName>server.servername2.siqom.siqom.us.com</displayName>
        <hierarchyType>WebService</hierarchyType>
        <hierarchyDomain>app.web</hierarchyDomain>
    </WebServiceImpl>
</results>

<?xml version="1.0" encoding="UTF-8"?>
<results
        xmlns="urn:www-collation-com:1.0"
        xmlns:coll="urn:www-collation-com:1.0"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:www-collation-com:1.0
              urn:www-collation-com:1.0/results.xsd">

    <WebServiceImpl array="1"
        guid="FFVVRJ5618KJRHNFUIRV845NRUVHR" xsi:type="coll:com.model.topology.app.web.WebService">
        <isPlaceholder>false</isPlaceholder>
        <displayName>server.servername3.siqom.siqom.us.com</displayName>
        <hierarchyType>WebService</hierarchyType>
        <hierarchyDomain>app.web</hierarchyDomain>
    </WebServiceImpl>
</results>


etc...

kane_004
  • 253
  • 3
  • 18

3 Answers3

7

There are a few ways going forward.

Splitting the file

First if your huge file is actually several concatenated XML files (as the sample you have shown), then this huge file is not a (valid) XML file and I advise splitting it before handling to a strict XML parsing library (Stax, DOM, SAX, XSL, whatever...).

A valid XML file only has one prolog and one root element.

You could use the XML prolog as a split marker, using pure IO / byte level APIs (no XML involved).

Each one of the splits can then be treated as a single XML "file" (independently if need be, for multithreading purposes). I do not mean "file" litterally, it could be a chunk of byte[] split from the original "huge file".

Speeding up XML Parsing

About your code

Using XMLEventReader, there are a few thing in your sample code that stick out.

  1. You should not iterate one the attributes as you do. Unless I'm missing something, you are not doing anything with this iteration.
  2. Once you are at the START_ELEMENT whose localName is displayName, you should call getElementText, which, being internal to the parser, has a few optimization tricks for speed that your while loop can not achieve. This call will leave the reader at the matching END_ELEMENT, so in effect, you simplify your code quite a bit (only check the displayName START_ELEMENT, that's all).
  3. Your XML seems well formed, so you can skip parsing as soon has you found a result
  4. XMLInputFactories are meant to be reused, so do not create one per file, create one shared instance.
  5. XML(xxx)Reader are closeable, so close them.
  6. Some XML libraries have faster character decoding schemes as the ones the JDK provide (knowing the internal of XML encodings allows them that), so if you have a valid XML prolog decalring the encoding at the beginning of the file, you should feed your factory with a File object or an InputStream, and not a Reader

Switching to XMLStreamReader

Other than that, you'd get faster performance out of XMLStreamReader than XMLEventReader. This is because XMLEvent instances are costly, thanks to their ability to stay useable even if the parser that created them has moved on. This means a XMLEvent is relatively heavyweight, that holds every possible bit of information relevent at the time of its creation (the namespace context, all attributes, ...), which has a cost to build, and a cost to hold in memory.

Events may be cached and referenced after the parse has completed.

XMLStreamReader does not emit any event, so does not pay this price. Seeing you only need to read a text value and has no use for the XMLEvent after its parsing, the stream reader will yield better performance.

Switching to a faster XMLStreamReader

Last time I checked (a bit too long ago), Woodstox was quite faster than the JDK standard Stax implementation (derived from Apache Xerces). But there might be faster kids around.

Try something else XML ?

I highly doubt you'd get faster performance out of any other parsing technology (SAX is usually equivalent, but you do not really have to option to quit the parsing as soon as you found your relevent tag). XSLT is pretty fast, but the amount of power it show comes with a performance price (usually some kind of lightweight DOM tree is built). So same goes for XPath, the expressiveness of the expressions usually imply some kind of complex structure being kept underneath. DOM is, of course, generally much slower.

What about not doing XML ?

It should probably be used only as a last resort, if every other bit of optimization has already been pulled, and you know for a fact that your XML processing is the bottleneck (not the IOs, not anything else, just the XML processing in and of itself).

As @MichaelKay states in the comments, not using XML tools may break at any point in the future because the way the files are created, while being completely equivalent in XML, might evolve and break a simple text based tool.

Using purely text based tools, you might get fooled by a change in the namespace declarations, varying line breaks, HTML entities encoding, external references, and many other XML specific subtelties, to get a fraction of extra performance.

Multi-threading your process

The use of multithreading could be a solution but it is not without caveats.

If your process runs in a typical EE server implementation, with advanced configurations and any kind of decent load, multithreading is not always a win because the system may already be lacking resources to spawn additionnal threads, and/or you may be defeating internal optimizations of the server by creating threads outside of its managed facilities.

If your process is a so-called lightweight application, or if its typical usage entails only a few users using it simultaneously, it is less likely that you would run into such issues and you might consider spawning an ExecutorService to do the XML parsing in parallel.

Another thing to consider is the IO. The XML Processing of individual files, CPU-wise, should profit as much as possible from the parallelisation of the parsing. But you might be bottlenecked by other parts of the process, usually IOs. If you can parse XML faster in a single CPU than you can pull data out of the disk, then parallelisation is of no use, you'd get many threads waiting for the disk, which might starve your system for not much (if anything). So you have to tune accordingly.

Changing the process

If you're stuck at reading a "huge file" or thousands of small files in a single unit of work, it might be a good opportunity to step back and look at your process.

  1. Reading thousands of small files have a cost in terms of IO and system calls, which in effect are blocking calls. Your java process has to wait for data comming out of the system-level stuff. If you have a way to minimise the number of system calls (open less files, using larger buffers...) this could be a win. I mean : reading a single tar file (containing 2000 small xml - a few kbs - files) can usually be achived faster than reading 2000 individual files.

  2. Doing the work pre-emptively / on the fly. Why would you wait untill the user asks for the data to parse the XMLs ? Would it not be possible to parse it as soon as the data arrives in the system (maybe asynchronously ?). That woul save you the trouble to read data from the disk, and might give you a chance to plug-in to a process that would have parsed the file anyway, saving time on both occasions. And then, you'd only have to query for the results (in a database of sorts) when the user request comes ?

Going forward

You can not build performance without measuring stuff.

So : measure.

How much does the IO cost ?

How much does the XML processing cost ? And what part of it ? (In your sample code, just the useless initialization of a XMLInputFactory` per file means there is a LOT to be gained if you had just measured it with a profiler)

How much does the other stuff in your service call cost ? (Do you connect to a DB before / after the call ? At each file ? Could that be done differently).

If you are still stuck, you may edit your question with those findings, to get further help.

GPI
  • 9,088
  • 2
  • 31
  • 38
  • -1 for the "try not doing XML" advice. You don't want to sacrifice the robustness, maintainability, and future-proofness of your application for a few grains of added performance unless you're totally desparate. We get so many SO questions from people with broken workflows caused by cutting corners like this. – Michael Kay Jan 02 '20 at 18:18
  • 1
    Thanks. I do agree with your comment to a large extent. I mentioned (meant to mention, rather than advise, but I’ll edit) this only because the problem is presented by the OP as the reading of a single huge file that is not an XML but a « cat » of individual XMLs, so messing around with text level inputs was already implied by the original format. I’ll make this clearer. – GPI Jan 02 '20 at 20:30
0

As I can see multiple xml files for parsing, you can use multi threading to parse 3 xml files at a time and store the object either inside a thread-safe list like CopyOnWriteArrayList or thread-safe Map like Concurrent Hash Map. If you are parsing using Stax parser it is already optimized and it is used for bigger xml files. Besides if you do not require all the data from the XMl, you can use XPath, again XPath and Streaming XML parsing are different.

PythonLearner
  • 1,416
  • 7
  • 22
  • So xPath looks like a very good option to achieve what I want, but for some reason it can't read my xml because I have the ```xmlns``` and ```xmlns:coll``` inside. I am not sure why but when I need all of this content it works. Not sure if there is a way to ignore it, without deleting it. – kane_004 Jan 02 '20 at 17:02
  • Yes, you can use XPath with xml namespace also. – PythonLearner Jan 02 '20 at 17:16
  • What do you mean by that? – kane_004 Jan 02 '20 at 17:18
  • This is the code I added ```XPathExpression expr = xpath.compile("//WebServiceImpl/displayName/text()");``` – kane_004 Jan 02 '20 at 17:19
  • If you want you can use XPath API with XML namespace. – PythonLearner Jan 02 '20 at 17:19
  • 1
    @kane_004 Search this site for "XPath default namespace" you will find hundreds of answers telling you how to use XPath with namespaced XML. – Michael Kay Jan 02 '20 at 18:07
  • @kane_004, as suggested by Michael Sir, you can refer this SO as example. https://stackoverflow.com/questions/3939636/how-to-use-xpath-on-xml-docs-having-default-namespace – PythonLearner Jan 02 '20 at 18:16
  • @Deb So I found out I have this statement in my code ```factory.setNamespaceAware(true);``` but not sure if it is working? – kane_004 Jan 02 '20 at 21:22
  • If you have problems with DOM and namespaces that's a completely separate issue from performance and is best handled in a different question. – Michael Kay Jan 03 '20 at 00:50
0

Where are the numbers? You can't tackle performance problems without measurements. What performance are you achieving? Is it chronically bad, or is it already close to the best you can reasonably expect?

There's only one performance "blunder" I can see in your code, and that's creating a new parser factory for each file (creating the factory is very expensive, it involves examining every JAR on the classpath). But then you confuse me: you say you are parsing one huge file (what does "huge" mean, actually?) but what you've shown seems to be a concatenation of many small XML documents. The two use cases are quite different from a performance point of view: with lots of small documents, initialising the parser is often a large part of the total cost.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • So yes, the XML will be provided through an external process using an API. The parser will have to read about 2,400 xml files and parse the ```displayName``` out of each of them. Using the DOM parser I was able to achieve it in about 1 minute 30 seconds. Using StAX, which is the process above I was able to do it in 19 seconds. I am trying to make it as close to 0 as possible because I need to post the results. – kane_004 Jan 02 '20 at 18:32
  • 120 files / second means you already spend a fair amount of time in IO and system calls. What’s left if you factor it out (e.g. measure the IO wait time of your process ?). What’s left also if you factor out the parser initialization ? (Or the other way around : mesure how fast you can read - but not process at all - the chars of these files, because that is your true, unbeatable « zero second ». On the other hand, by altering your process, don’t you have an opportunity to read the files from an archive (tar ?). That would eliminate thousands of system calls, as you’d read only one file. – GPI Jan 02 '20 at 20:26
  • @GPI So I found out I have this statement in my code ```factory.setNamespaceAware(true);``` but not sure if it is working? – kane_004 Jan 02 '20 at 21:23
  • I'm a little surprised by the difference between DOM and StAX, but you still haven't answered my question about file sizes -- how big is "huge"? Is it big enough to cause memory thrashing? And really, you need to set a target based on business requirements. "As close as possible to zero" is like trying to build a skyscraper that reaches as high as you can get to the sky; that's not a practical engineering target that you can design towards. – Michael Kay Jan 03 '20 at 00:47
  • @MichaelKay So I am reading the XML from an API and it is parsing the data from the xml ```displayName``` tag and getting and then placing that value as a string in an array. Depending on how many of those items are in the array, will affect how long it takes to run. The reason why I want it to be as short time wise as possible is because I am prompting a screen in my front-end of my site to view the results in that array. I am trying to have there be as little of a waiting time as possible. – kane_004 Jan 03 '20 at 02:11