2

I am trying to split a large XML file into smaller files using java's SAXParser (specifically the wikipedia dump which is about 28GB uncompressed).

I have a Pagehandler class which extends DefaultHandler:

private class PageHandler extends DefaultHandler {

   private StringBuffer text;
   ...

  @Override
  public void startElement(String uri, String localName, String qName, Attributes attributes) {

        text.append("<" + qName + ">");
  }

  @Override
  public void endElement(String uri, String localName, String qName) {

        text.append("</" + qName + ">");

        if (qName.equals("page")) {
            text.append("\n");
            pageCount++;
            writePage();
        }

        if (pageCount >= maxPages) {
            rollFile();
        }
    }

  @Override
  public void characters(char[] chars, int start, int length) {
        for (int i = start; i < start + length; i++) {
            text.append(chars[i]);
        }
    }
}

So I can write out element content no problem. My problem is how to get the element tags and attributes - these characters do not seem to be reported. At best I will have to reconstruct these from what's passed as arguments to startElement - which seems a bit of a a pain. Or is there an easier way?

All I want to do is loop through the file and write it out, rolling the output file every-so-often. How hard can this be :)

Thanks

Richard H
  • 38,037
  • 37
  • 111
  • 138
  • VTD-XML is ideally suited for splitting large XML, the extended edition supports xml up to 256 gb in size, it also supports mem-map and you can use xpath too – vtd-xml-author Feb 27 '11 at 19:31

2 Answers2

1

I'm not quite sure I totally understand what you are trying to do but to get the qualified name as a string you simply do qName.toString() and to get the attributes name you just do atts.getQName(int index).

Octavian Helm
  • 39,405
  • 19
  • 98
  • 102
  • thanks for this. Now my problem is that elements ontain xml character references which are being decoded by the parser - so I'm writing out ">" as opposed to >. Any idea how to work around this? – Richard H Oct 03 '10 at 16:19
  • @Richard: if you use dom4j, as I suggested in my answer, it will automatically encode these special characters for you. It's another benefit of using a library instead of writing XML documents out yourself. – Richard Fearn Oct 03 '10 at 16:46
  • @Richard - yes agreed. thanks for this and your answer to my other question. I'm trying to echo directly without decoding then recoding if possible. – Richard H Oct 04 '10 at 08:37
0

The problem here is that you're writing the XML elements out yourself. Have a look at the XMLWriter class of dom4j - while it's a little old, it makes it really easy to output XML documents by calling its startElement and endElement methods.

Richard Fearn
  • 25,073
  • 7
  • 56
  • 55