2

I have XML file (GML file) which may contain 1GB up file size that need to split into several xml files based on the content.

Basically, I need a parser which doesn't load the content into memory. must be run in 32bit. target OS is Windows XP UP.

I am thinking of the following options:

  1. extending org.xml.sax.helpers.DefaultHandler

  2. use Xerces

  3. use VTD-XML (if doesn't load the content into memory; i know Huge classes of VTD-XML but it can be used only 64bit platform; if there's a way to use VTD-XML with 32bit in a 2GB file size)

Any guidance on the right direction is appreciated.

Thanatos
  • 42,585
  • 14
  • 91
  • 146
eros
  • 4,946
  • 18
  • 53
  • 78
  • possible duplicate of [Split 1GB Xml file using Java](http://stackoverflow.com/questions/5169978/split-1gb-xml-file-using-java) – bdoughan May 30 '11 at 09:57
  • i've read the link. Thanks. I will eliminate the 3rd option, using VTD-XML due to author's comment there. "Without namespace support, vtd-xml supports file size up to 2GB in size. With extended VTD-XML has a file size limit of 256 GB, even with namespace support." My target size could be larger than 2GB with 32Bit platform requirement. – eros May 31 '11 at 00:14

4 Answers4

2

http://vtd-xml.sourceforge.net/

Aravind Yarram
  • 78,777
  • 46
  • 231
  • 327
1

See Fastest XML parser for small, simple documents in Java. (question is on small files and dom processing, answers fits to big files as well)

In general you use SAX/stream parsers to do the work. (option 1)

Community
  • 1
  • 1
Jayan
  • 18,003
  • 15
  • 89
  • 143
  • the question you link to has a *requirement* that the result is a DOM. This is exactly what this question is **not** about. – Joachim Sauer May 30 '11 at 06:44
  • Thanks.. Discussion there points to Sax based approach, so I thought it is appropriate – Jayan May 30 '11 at 09:52
1

Use a SAX (or StAX) parser (Aalto?) and writer at the same time.

I assume the document wrapper (root tree) is known.

  1. First read past the initial start (wrapper) elements.

  2. Then open a new writer, write the document start wrapper. Then continue to read and write corresponding events until your stop criteria. Then write the end document wrapper. Repeat n times.

  3. Stop when your reader hits the end document wrapper.

For 1 and 3: I find keeping track of the node level is more useful than checking element names; it usually works and is quicker.

Obviously you can forward wrapper details, if present, by adding some variables in point 1 and applying them in point 2. Your stop criteria should be some number of nodes, checking file size all the time will slow things down.

ThomasRS
  • 8,215
  • 5
  • 33
  • 48
0

If your splitting algorithm doesn't need much context (i.e. there's no need for a DOM or a partial DOM), then SAX (i.e. implementing a DefaultHandler) is certainly one of the simplest approaches and doesn't add an external dependency.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • this is the first one suggested me to use SAX approach. It's really fast but I use javolution classes. Thanks. – eros Jun 30 '11 at 01:16