8

I'm currently trying to use JAXB to unmarshal an XML file, but it seems that the XML file is too large (~500mb) for the unmarshaller to handle. I keep getting java.lang.OutOfMemoryError: Java heap space @

Unmarshaller um = JAXBContext.newInstance("com.sample.xml");
Export e = (Export)um.unmarhsal(new File("SAMPLE.XML"));

I'm guessing this is becuase it's trying to open the large XML file as an object, but the file is just too large for the java heap space.

Is there any other more 'memory efficient' method of parsing large XML files ~ 500mb? Or perhaps an unmarshaller property that may help me handle the large XML file?

Here's what my XML looks like

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!-- -->
<Export xmlns="wwww.foo.com" xmlns:xsi="www.foo1.com" xsi:schemaLocation="www.foo2.com/.xsd">
<!--- --->
<Origin ID="foooo" />
<!---- ---->
<WorkSets>
   <WorkSet>
      <Work>
         .....
      <Work>
         ....
      <Work>
      .....
   </WorkSet>
   <WorkSet>
      ....
   </WorkSet>
</WorkSets>

I'd like to unmarshal at the WorkSet level, still being able to read through all of the work for each WorkSet.

TyC
  • 792
  • 6
  • 11
  • 23

4 Answers4

10

What does your XML look like? Typically for large documents I recommend people leverage a StAX XMLStreamReader so that the document can be unmarshalled by JAXB in chunks.

input.xml

In the document below there are many instances of the person element. We can use JAXB with a StAX XMLStreamReader to unmarshal the corresponding Person objects one at a time to avoid running out of memory.

<people>
   <person>
       <name>Jane Doe</name>
       <address>
           ...
       </address>
   </person>
   <person>
       <name>John Smith</name>
       <address>
           ...
       </address>
   </person>
   ....
</people>

Demo

import java.io.*;
import javax.xml.stream.*;
import javax.xml.bind.*;

public class Demo {

    public static void main(String[] args) throws Exception  {
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
        xsr.nextTag(); // Advance to statements element

        JAXBContext jc = JAXBContext.newInstance(Person.class);
        Unmarshaller unmarshaller = jc.createUnmarshaller();
        while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
            Person person = (Person) unmarshaller.unmarshal(xsr);
        }
    }

}

Person

Instead of matching on the root element of the XML document we need to add @XmlRootElement annotations on the local root of the XML fragment that we will be unmarshalling from.

@XmlRootElement
public class Person {
}
bdoughan
  • 147,609
  • 23
  • 300
  • 400
  • I was getting in error in your last line and was required to (in your example's case) cast `(Person) unmarshaller.unmarshal(xsr);`. Is this correct? – TyC Nov 01 '11 at 18:18
  • How does the XMLStreamReader distinguish between start elements? For example, does it try to create a new instance of Person when it comes around any start element? – TyC Nov 01 '11 at 18:47
  • 1
    @TyC - `XMLStreamReader` is just going to give us access to XML events in depth-first order. The trick is we need to recognize the start element states of portions of the XML we want JAXB to unmarshal. JAXB will then advance the `XMLStreamReader` to the end of that element. Then we look for the next fragment we want to unmarshal from. – bdoughan Nov 01 '11 at 19:03
  • my program isn't entering the `while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT)`. As soon as it gets to it the program outputs null. I've updated my XML above, is it because it's hitting other elements before getting to the `WorkSet` or `Person` in your case? – TyC Nov 01 '11 at 19:19
  • 1
    @Tyc - You'll need to play with advancing the `XMLStreamReader` to get things just write. You can ask the `XMLStreamReader` for the name of the current node to see where you are in the traversal. – bdoughan Nov 01 '11 at 19:28
5

You could increase the heap space using the -Xmx startup argument.

For large files, SAX processing is more memory-efficient since it's event driven, and doesn't load the entire structure in to memory.

Dave Newton
  • 158,873
  • 26
  • 254
  • 302
2

I've been doing a lot of research in particular with regards to parsing very large input sets conveniently. It's true that you could combine StaX and JaxB to selectively parse XML fragments, but it's not always possible or preferable. If you're interested to read more on the topic please have a look at:

http://xml2java.net/documents/XMLParserTechnologyForProcessingHugeXMLfiles.pdf

In this document I describe an alternative approach that is very straight forward and convenient to use. It parses arbitrarily large input sets, whilst giving you access to your data in a javabeans fashion.

1

Use SAX or StAX. But if the goal is to have an in-memory object representation of the file, you'll still need lots of memory to hold the contents of such a big file. In this case, your only hope is to increase the heap size using the -Xmx1024m JVM option (which sets the max heap size to 1024 MBs)

JB Nizet
  • 678,734
  • 91
  • 1,224
  • 1,255