Efficient merging of multiple, large xml files into one

Question

I searched the web and I searched stackoverflow up and down. No solution. Although I found solutions how to do this within pure xslt here.

But the problem is that the resulting xml will be several hundred MB large. So I must do this with SAX in Java. (please no xslt solution, although I tagged it with xslt ;-))

Let me explain with more detail. I have several multiple xml files (preferable InputSteam's) which should be parsed. The files or InputStream's looks like

inputstream1

<root>
  <doc>
    <tag>test1</tag>
  </doc>
  <doc>
    <tag>test2</tag>
  </doc>
  ...
</root>

inputstream2

<root>
  <doc>
    <tag>test3</tag>
  </doc>
  <doc>
    <tag>test4</tag>
  </doc>
  ...
</root>

inputstream1+inputstream2+...+inputstreamN = resulting xml. It will look like

<root>
  <doc>
    <tag>test1</tag>
  </doc>
  <doc>
    <tag>test2</tag>
  </doc>
  ...
   <doc>
    <tag>test3</tag>
  </doc>
  <doc>
    <tag>test4</tag>
  </doc>
  ...
</root>

Do someone has a solution or a link for this? Is this possible via implementing a custom InputSource or should I use a custom ContentHandler? Or is this possible with joost/stx?

The nice thing if I could use a ContentHandler would be that I could apply some minor transformations (I already implemented this). But then the problem is that I don't know a way to pass multiple files or InputStream's as InputSource:

XMLReader xmlReader = XMLReaderFactory.createXMLReader();
xmlReader.setContentHandler(customHandler);
xmlReader.parse(getInputSource()); // only one InputStream possible

or should I parse the InputStreams directly within my ContentHandler?

score 2 · Answer 1 · answered Feb 16 '10 at 20:42

2

I haven't done this myself, but I recalled seeing an IBM developerworks article that looked like it made this pretty easy.

It's a bit old now, but try http://www.ibm.com/developerworks/xml/library/x-tipstx5/index.html

This is StAX instead of SAX. I'm not sure current JDKs include StAX. If not you can probably get it from http://stax.codehaus.org/

answered Feb 16 '10 at 20:42

Don Roby

40,677
6
91
113

+1 JDK includes StAX since at 1.5 as far as I remember. Much more convenient to use than SAX. – helpermethod Oct 03 '10 at 15:33

score 1 · Answer 2 · answered Feb 16 '10 at 20:08

1

You may want to have a look at the pay-for version of Saxon. It can handle on-the-fly XSLT not needing the full DOM in memory.

answered Feb 16 '10 at 20:08

Thorbjørn Ravn Andersen

73,784
33
194
347

hmmh, in xslt you could look up the first node and the last node regardless where you are. Ie.: all needs to be in memory ... per definition of xslt. Or what do you think? – Karussell Feb 16 '10 at 20:12
There is a fairly large subset of XSLT-programs you do not need the full DOM trees in memory to execute. – Thorbjørn Ravn Andersen Feb 16 '10 at 21:59

Karussell · Accepted Answer · 2010-02-17T14:30:51.427

I finally managed this via the following snippet:

  finalHandler = new StreamResult(new OutputStreamWriter(System.out));
  // customHandler extends DefaultHandler
  CustomTransformerHandler customHandler = new CustomTransformerHandler(
         finalHandler);
  customHandler.startDocumentExplicitly();
  InputStream is = null;
  while ((is = customHandler.createNextInputStream()) != null) {
    // multiple inputStream parsing
    XMLReader myReader = XMLReaderFactory.createXMLReader();
    myReader.setContentHandler(customHandler);
    myReader.parse(new InputSource(is));
  }
  customHandler.endDocumentExplicitly();

The important part was to leave the startDocument and endDocument methods empty. All other methods (characters, startElement, endElement) will be redirected to the finalHandler. The customHandler.createNextInputStream method returns null if all inputstreams are read.

vtd-xml-author · Answer 4 · 2016-05-01T07:33:24.970

the most effective way to merge files are to use byte level cut and paste feature offered by VTD-XML, AFAIK. You take both files, parse them into VTDNav objects, then instantiate an XMLModifier object, grab the fragments from the second file, and insert them into the first file... that got to be far more efficient than SAX.. Also the resultant XML gets written direction onto a file -- there is no need to store it in memory. Below is the complete code in less than 20 lines...

import com.ximpleware.*;
import java.io.*;

public class merge {
    // merge second.xml into first.xml assuming the same encoding
    public static void main(String[] s) throws VTDException, IOException{
        VTDGen vg = new VTDGen();
        if (!vg.parseFile("d:\\xml\\first.xml", false))
            return;
        VTDNav vn1=vg.getNav();
        if(!vg.parseFile("d:\\xml\\second.xml", false))
            return;
        VTDNav vn2 = vg.getNav();
        XMLModifier xm = new XMLModifier(vn1);
        long l = vn2.getContentFragment();
        xm.insertBeforeTail(vn2, l);
        xm.output("d:\\xml\\merged.xml");   
    }
}

hmmh, but I don't want to have them in memory ... just pipe them directly to the disc. And I don't understand how that will be faster than sax. — Karussell, Feb 18 '10 at 09:45
ok. thanks for the vtd-xml hint. It looks promising (from what I can read on the sourceforge website). But although it might be 100 times faster. If it takes 100% RAM of the doc (or even more) I cannot use it: it could be that the resulting xml won't fit even into memory. — Karussell, Feb 18 '10 at 22:16

Efficient merging of multiple, large xml files into one

4 Answers4