13

I have a 1GB Xml file, how can I split it into well-formed, smaller size Xml files using Java ?

Here is an example:

<records>
  <record id="001">
    <name>john</name>
  </record>
 ....
</records>

Thanks.

user534009
  • 1,419
  • 4
  • 23
  • 25

4 Answers4

19

I would use a StAX parser for this situation. It will prevent the entire document from being read into memory at one time.

  1. Advance the XMLStreamReader to the local root element of the sub-fragment.
  2. You can then use the javax.xml.transform APIs to produce a new document from this XML fragment. This will advance the XMLStreamReader to the end of that fragment.
  3. Repeat step 1 for the next fragment.

Code Example

For the following XML, output each "statement" section into a file named after the "account attributes value":

<statements>
   <statement account="123">
      ...stuff...
   </statement>
   <statement account="456">
      ...stuff...
   </statement>
</statements>

This can be done with the following code:

import java.io.File;
import java.io.FileReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;

public class Demo {

    public static void main(String[] args) throws Exception  {
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
        xsr.nextTag(); // Advance to statements element

        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer t = tf.newTransformer();
        while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
            File file = new File("out/" + xsr.getAttributeValue(null, "account") + ".xml");
            t.transform(new StAXSource(xsr), new StreamResult(file));
        }
    }

} 
bdoughan
  • 147,609
  • 23
  • 300
  • 400
  • 3
    Why involve javax.xml.transform when we can pipe directly from XMLStreamReader to XMLStreamWriter, rolling to a new file between every nth record element? – Ron Mar 02 '11 at 16:37
  • 2
    Yea this is the hot tip, just "pipe" them together and occasionally "close" and reopen the XMLStreamWriter every N records. – Will Hartung Mar 02 '11 at 17:29
  • Can't transform a Source of type javax.xml.transform.stax.StAXSource ?? – Beta033 Jun 27 '12 at 17:42
  • @Beta033 - What version of the JDK are you using. I just reran the code as is and it worked perfectly fine. I am using Oracle JDK 1.7.0 for the Mac. – bdoughan Jun 27 '12 at 18:24
  • Beta033, you might need this: System.setProperty("javax.xml.transform.TransformerFactory", "com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl"); – Flyhard Oct 22 '13 at 14:34
  • @BlaiseDoughan - `nextTag`, by definition, does not work if there is no whitespace or line break between a closing and starting `` tag, e.g. ``. Could you recommend how to go about if my XML has tags with no whitespaces in between? – Somu Apr 23 '14 at 11:50
  • @Somu What do you mean "by definition it does not work if there is no whitespace between a closing and starting tag"? The javadoc just states that the `nextTag()` method will skip over any whitespace there is, not that it *needs* to be there. – Frans Jan 03 '17 at 10:50
  • 2
    @Somu I did have to change the `while` loop to `while (xsr.isStartElement() || xsr.nextTag() == XMLStreamConstants.START_ELEMENT)` and add an extra `xsr.nextTag()` just before the `while` loop. Perhaps that will work for you as well? The problem is that the sub-fragment transformation also advances to the next element so that the `nextTag()` moves one level too deep. – Frans Jan 03 '17 at 11:02
4

Try this, using Saxon-EE 9.3.

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:mode streamable="yes"/>
    <xsl:template match="record">
      <xsl:result-document href="record-{@id}.xml">
        <xsl:copy-of select="."/>
      </xsl:result-document>
    </xsl:template>
</xsl:stylesheet>

The software isn't free, but if it saves you a day's coding you can easily justify the investment. (Apologies for the sales pitch).

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
3

DOM , STax, SAX all will do but have there own pros and cons.

  1. You can't put all the data in-memory in case of DOM.
  2. Programming control is easier in case of DOM then Stax and then SAX.
  3. A combination of SAX and DOM is a better option.
  4. Using a Framework which already does this can be the best option. Have a look at smooks.http://www.smooks.org

Hope this helps

Manish Singh
  • 3,463
  • 22
  • 21
0

I respectfully disagree with Blaise Doughan. SAX is not only hard to use, but very slow. With VTD-XML, you can not only use XPath to simplify processing logic (10x code reduction very common) but also much faster because there is no redundant encoding/decoding conversion. Below is the java code with vtd-xml

import java.io.FileOutputStream;
import com.ximpleware.*; 

public class split {
    public static void main(String[] args) throws Exception {       
        VTDGen vg = new VTDGen();       
        if (vg.parseHttpUrl("c:\\xml\\input.xml", true)) {
            VTDNav vn = vg.getNav();
            AutoPilot ap = new AutoPilot(vn);
            ap.selectXPath("/records/record");
            int i=-1,j=0;
            while ((i = ap.evalXPath()) != -1) {
            long l=vn.getElementFragment();
                (new FileOutputStream("out"+j+".xml")).write(vn.getXML().getBytes(), (int)l,(int)(l>>32));
                j++;
            }
        }
    }
}
vtd-xml-author
  • 3,319
  • 4
  • 22
  • 30
  • 1
    My suggestion was to use StAX not SAX. Also, from VTD-XML FAQ (http://vtd-xml.sourceforge.net/faq.html) the 1GB file size mentioned in the question is the upper bound of VTD-XML's range for handling namepace qualified XML. – bdoughan Mar 02 '11 at 21:39
  • 2
    There's no significant performance difference between StAX and SAX. Both are as fast as you will get. Some people might find StAX easier to use, however - using an event-based programming model like SAX requires more programming maturity. – Michael Kay Mar 02 '11 at 23:29
  • Without namespace support, vtd-xml supports file size up to 2GB in size. With extended VTD-XML has a file size limit of 256 GB, even with namespace support. – vtd-xml-author Mar 07 '11 at 01:56
  • 1
    This is piece from your code (`VTDGen.parseFile()` method): `fis = new FileInputStream(f); byte[] b = new byte[(int) f.length()];`. So, you load all file in memory. This is really disgustingly. – Andremoniy Aug 01 '14 at 11:37
  • @Andremoniy--loading everything in memory is not the issue, as long as it doesn't blow up like DOM that causes out of memory exception... nowadays, 64-bit machine with 4GB memory is so common, am I not right? – vtd-xml-author Aug 28 '16 at 02:21
  • 1
    @vtd-xml-author OP doesn't mention the type of environment this needs to run in. But if it is a multi-user environment and each user might be running this code, than a 4 GB machine will let max 4 users split up a 1 GB file like this. That might not be enough. – Frans Jan 03 '17 at 10:43