What is the fastest way to parse & split XML content with huge file size (800MB UP) into several xml files in Java

Question

I have XML file (GML file) which may contain 1GB up file size that need to split into several xml files based on the content.

Basically, I need a parser which doesn't load the content into memory. must be run in 32bit. target OS is Windows XP UP.

I am thinking of the following options:

extending org.xml.sax.helpers.DefaultHandler
use Xerces
use VTD-XML (if doesn't load the content into memory; i know Huge classes of VTD-XML but it can be used only 64bit platform; if there's a way to use VTD-XML with 32bit in a 2GB file size)

Any guidance on the right direction is appreciated.

possible duplicate of [Split 1GB Xml file using Java](http://stackoverflow.com/questions/5169978/split-1gb-xml-file-using-java) — bdoughan, May 30 '11 at 09:57
i've read the link. Thanks. I will eliminate the 3rd option, using VTD-XML due to author's comment there. "Without namespace support, vtd-xml supports file size up to 2GB in size. With extended VTD-XML has a file size limit of 256 GB, even with namespace support." My target size could be larger than 2GB with 32Bit platform requirement. — eros, May 31 '11 at 00:14

score 2 · Answer 1 · answered May 30 '11 at 04:18

2

http://vtd-xml.sourceforge.net/

answered May 30 '11 at 04:18

Aravind Yarram

78,777
46
231
327

1

This answer is not useful. That's why I added my comment on 3rd option. – eros May 31 '11 at 00:21

score 1 · Answer 2 · edited May 23 '17 at 11:50

1

See Fastest XML parser for small, simple documents in Java. (question is on small files and dom processing, answers fits to big files as well)

In general you use SAX/stream parsers to do the work. (option 1)

edited May 23 '17 at 11:50

Community

1
1

answered May 30 '11 at 02:24

Jayan

18,003
15
89
143

the question you link to has a *requirement* that the result is a DOM. This is exactly what this question is **not** about. – Joachim Sauer May 30 '11 at 06:44
Thanks.. Discussion there points to Sax based approach, so I thought it is appropriate – Jayan May 30 '11 at 09:52

ThomasRS · Answer 3 · 2011-07-01T11:24:16.913

Use a SAX (or StAX) parser (Aalto?) and writer at the same time.

I assume the document wrapper (root tree) is known.

First read past the initial start (wrapper) elements.
Then open a new writer, write the document start wrapper. Then continue to read and write corresponding events until your stop criteria. Then write the end document wrapper. Repeat n times.
Stop when your reader hits the end document wrapper.

For 1 and 3: I find keeping track of the node level is more useful than checking element names; it usually works and is quicker.

Obviously you can forward wrapper details, if present, by adding some variables in point 1 and applying them in point 2. Your stop criteria should be some number of nodes, checking file size all the time will slow things down.

score 0 · Accepted Answer · answered May 30 '11 at 06:46

0

If your splitting algorithm doesn't need much context (i.e. there's no need for a DOM or a partial DOM), then SAX (i.e. implementing a DefaultHandler) is certainly one of the simplest approaches and doesn't add an external dependency.

answered May 30 '11 at 06:46

Joachim Sauer

302,674
57
556
614

this is the first one suggested me to use SAX approach. It's really fast but I use javolution classes. Thanks. – eros Jun 30 '11 at 01:16

What is the fastest way to parse & split XML content with huge file size (800MB UP) into several xml files in Java

4 Answers4

Linked

Related