Parsing large XML documents in JAVA

Question

I have the following problem:

I've got an XML file (approx 1GB), and have to iterate up and down (i.e. not sequential; one after the other) in order to get the required data and do some operations on it. Initially, I used the DOM Java package, but obviously, while parsing through the XML file, the JVM reaches its maximum heap space and halted.

In order to overcome this problem, one of the solutions I came up with, was to find another parser that iterates each element in the XML and then I store it's contents in a temporary SQLite Database on my Hard disk. Hence, in this way, the JVM's heap is not exceeded, and once all data is filled, I ignore the XML file and continue my operations on the temporary SQLite Database.

Is there another way how I can tackle my problem in hand?

As others have said you need to use a SAX parser instead of a DOM parser, it will do exactly what you need. Read this: http://stackoverflow.com/questions/6828703/difference-about-sax-and-dom — cowls, Feb 28 '13 at 10:05
If you cannot hold the whole DOM tree, you must find a way to do your processing sequentially. Is that possible? Can you show an XSLT which does what you need? — Thorbjørn Ravn Andersen, Feb 28 '13 at 10:07
For parsing large xml files always use SAX Parser. Refer following link [StackOverflow](http://stackoverflow.com/questions/3825206/why-is-sax-parsing-faster-than-dom-parsing-and-how-does-stax-work) — Yogesh Kulkarni, Feb 28 '13 at 10:08
What do you mean by non-sequential operation? Are there different data in your XML, and you have cross-references between them? Either XML parser you use, you have to store all the data in memory. Rather **try giving more `-Xmx` to the JVM**, it should easily handle 1G. — gaborsch, Feb 28 '13 at 10:10
@GaborSch ... I 've already tried increasing the Java Heap Space and the same exception occurred.. Also, since the size might increase further, I would rather opt to a solution that will work irrespective of this limit.. By non-sequential, for example I might need data from element 2 while in element 5.. And yes as you pointed out, there is different data in my XML and require to cross reference. I think as the other lads pointed out would be best to use SAX parser, that will store the current element tag in memory ONLY (rather than the whole XML structure).. — cgval, Feb 28 '13 at 10:19
I agree with the SAX approach, but - since you have cross-references - you have to store the whole data in memory, so later (in the second round) you can interpret your data and resolve all references. With SAX you can store them more memory-optimized version: in the fixed structure you don't store the XML node names. That will do it to a certain extent, but it's still limited by the memory. If you need something not bound by the memory limits, you should use the database approach you are already using. — gaborsch, Feb 28 '13 at 10:39

score 13 · Accepted Answer · 2013-02-28T11:06:58.030

SAX (Simple API for XML) will help you here.

Unlike the DOM parser, the SAX parser does not create an in-memory representation of the XML document and so is faster and uses less memory. Instead, the SAX parser informs clients of the XML document structure by invoking callbacks, that is, by invoking methods on a org.xml.sax.helpers.DefaultHandler instance provided to the parser.

Here is an example implementation:

SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
DefaultHandler handler = new MyHandler();
parser.parse("file.xml", handler);

Where in MyHandler you define the actions to be taken when events like start/end of document/element are generated.

class MyHandler extends DefaultHandler {

    @Override
    public void startDocument() throws SAXException {
    }

    @Override
    public void endDocument() throws SAXException {
    }

    @Override
    public void startElement(String uri, String localName, String qName,
            Attributes attributes) throws SAXException {
    }

    @Override
    public void endElement(String uri, String localName, String qName)
            throws SAXException {
    }

    // To take specific actions for each chunk of character data (such as
    // adding the data to a node or buffer, or printing it to a file).
    @Override
    public void characters(char ch[], int start, int length)
            throws SAXException {
    }

}

If you have ever done SAX parsing, you probably know that `characters()` method is also very important, and you have to do a **buffering** of character data, because it is not guaranteed that a content data is handled in one block (that is, two `character()` call can be done immediately). I think it is worth to mention. — gaborsch, Feb 28 '13 at 10:44
I didn't mean my solution to be complete. This was only an elementary implementation. Thanks for pointing out though. I'll update my answer with that. — , Feb 28 '13 at 11:04

score 3 · Answer 2 · answered Feb 28 '13 at 11:30

3

If you don't want to be bound by the memory limits, I certainly recommend you to use your current approach, and store everything in database.

The parsing of the XML file should be done by a SAX parser, as everybody has recommended (including me). This way you can create one object at a time, and you can immediately persist it into the database.

For the post-processing (resolving cross-references), you can use SELECTs from the database, make primary keys, indexes, etc. You can use ORM (Eclipselink, Hibernate) as well if you feel comfortable with that.

Actually I don't really recommend SQLite, it's easier to set up a MySQL server, and store the data there. Later you can even reuse the XML data (if you don't delete).

answered Feb 28 '13 at 11:30

gaborsch

15,408
6
37
48

I wonder how someone can believe that it is easier to setup a whole database server instead of using an embedded database, where you only have to include a JAR file without to install anything. I think for this usecase a separate database server would be overkill. Maybe there are some other good reasons to use a database server, but easier to setup? Really? – vanje Mar 01 '13 at 22:04
@vanje I didn't mean Oracle :) we're talking about MySQL. Seriously, I cannot believe that it would be a problem for any developer to set up a MySQL server. – gaborsch Mar 01 '13 at 22:16
I think every developer should be able to perform a basic installation of both Oracle and MySQL. And I agree with you that Oracle is far more complex than MySQL. But this is not the point. You compared MySQL with SQLite and stated that MySQL would be easier to setup. But you didn't mentioned what exactly is easier on your opinion. – vanje Mar 04 '13 at 14:54

score 1 · Answer 3 · answered Feb 28 '13 at 14:49

If you want to use a higher-level approach than SAX, which can be very tricky to program, you could look at streaming XSLT transformations using a recent Saxon-EE release. However, you've been too vague about the precise processing that you are doing to know whether this will work for your particular case.

score 0 · Answer 4 · answered Feb 20 '15 at 15:35

if you require a resource friendly approach to handle very large xml try this: http://www.xml2java.net/xml-to-java-data-binding-for-big-data/ it allows you to process data in a SAX way, but with the advantage of getting high level events (xml data mapped onto java) and being able to work with these objects in your code directly. so it combines jaxb convenience and SAX resource friendlyness.

Parsing large XML documents in JAVA

4 Answers4

Linked