How to spit the large XML(more than 3GB) using VTD-XML extended

Question

I have to split an xml which is of minimum size of 3GB. We can provide only 1.5GB heap space in 64 bit JVM on Windows OS. I have got example codes all over the Internet using VTDNav only, not with VTDNavHuge. The agenda is to read the above mentioned huge XML and extract a paticular node from it using Xpath and create a new xml with the above extracted content. I am always getting OutOfMemomry exception, though it was mentioned that we can process upto 256GB file also using VTD extended. That is using VTDNavHuge. Please help me with sample code to complete the above mention task under provided development environment. >3GB size file and 1.5GB heap space. I am trying to use memory mapped mode while parsing the file with VTD XML extended.

Anything that requires loading the document into memory will probably fail with files that big. Look into either coding something based on StAX or STX (Streaming Transformations for XML). I'll leave the actual code up to you. — David Ehrmann, Sep 22 '14 at 15:46
This might point you in the right direction: http://stackoverflow.com/questions/1134189/can-jaxb-parse-large-xml-files-in-chunks — spudone, Sep 22 '14 at 22:04
Can you provide sample xml so i can give it a try on my machine? — vtd-xml-author, Sep 25 '14 at 18:49

score 0 · Accepted Answer · answered Sep 29 '14 at 03:23

0

This is a demonstration of how to use the extended VTD parser to process large XML file. You need 64-bit JVM to take full advantage of extended VTD.

import com.ximpleware.extended.*;
public class mem_mapped_read {
    public static void main(String[] s) throws Exception{
        VTDGenHuge vg = new VTDGenHuge();
        if (vg.parseFile("test.xml",true,VTDGenHuge.MEM_MAPPED)){
            VTDNavHuge vnh = vg.getNav();
        AutoPilotHuge aph = new AutoPilotHuge(vnh);
        aph.selectXPath("//*");
        int i = 0;
        while ((i=aph.evalXPath())!=-1){
            System.out.println(" element name is "+vnh.toString(i));
        }

        }
    }
}

answered Sep 29 '14 at 03:23

vtd-xml-author

3,319
4
22
30

Could u please specify that "How much the Heap size do i need to set for reading this much big xml(>3GB up to 5 GB)?" I've read some where in IBM site that there is limit on heap size [http://publib.boulder.ibm.com/infocenter/javasdk/tools/index.jsp?topic=%2Fcom.ibm.java.doc.igaa%2F_1vg000139b8b453-11951f1e7ff-8000_1001.html]. but as we are using 64 bit jvm, there is no linit in setting Heap space. but requirement is having a limit of providing maximum heap space of 2 GB as i bargained with the designer.I will provide the sample structure in a while. – Jayanand Sep 29 '14 at 06:29
Thanks for the input **Mr.vtd-xml-author** . it worked well. I need some more info on this. I will post the queries in a couple of hours. – Jayanand Sep 30 '14 at 06:26
Hi *Mr.vtd-xml-author*. I am using the code provided in the specified link [http://blog.msbbc.co.uk/2011/04/java-handling-large-xml-documents-with.html]. i was able to parse the xml of size more than 2GB with 1 GB RAM. it is taking almost 9 minutes to generate the output xml. The sample structure of the xml file is like this. – Jayanand Oct 01 '14 at 09:45
Hi **Mr.vtd-xml-author**. I am using the code provided in the specified link [http://blog.msbbc.co.uk/2011/04/java-handling-large-xml-documents-with.html]. i was able to parse the xml of size more than 2GB with 1 GB RAM. it is taking almost 9 minutes to generate the output xml. The sample structure of the xml file is like this.' somedatainside this so many child nodes' – Jayanand Oct 01 '14 at 09:55
The input is **id** value and the **physical path** of the xml file. Please suggest me how to optimize the above code so that the task " read the input xml and extract the node data matching with the given student id to a new xml. for your information the output xml can be max of 80kb size only. Please help me in optimizing the code. – Jayanand Oct 01 '14 at 09:56
Hi **Mr.vtd-xml-author**. Any suggestions available on this? – Jayanand Oct 06 '14 at 19:32
sorry,2GB file on a 1 GB machine? Lack of memory could be the reason why it is so slow... i saw that your file is quite simple structurally... so XPath is well suited for this... I think if you can afford to put more memory in it teh performance would imporve 10x easy... – vtd-xml-author Oct 06 '14 at 21:40
Hi **Mr.vtd-xml-author**. Thanks for the suggestion. The business requirement is to provide 1GB Heap space only.It is taking 6-8mins approx for processing 2.1GB file using VTD EXTENDED. Could u please provide any code tuning, for the code provided in the link **http://blog.msbbc.co.uk/2011/04/java-handling-large-xml-documents-with.html**. I am using this code for creating new xml. please suggest if any fine tuning is required. – Jayanand Oct 07 '14 at 18:30
I looked the code on that website, it looks good. Without any file specific info, it would be hard to offer any thing concrete for code tuning. Can you package up a test example, and send it to me? – vtd-xml-author Oct 08 '14 at 01:11
Hi **Mr.vtd-xml-author** The task details are mentioned below.The sample structure of the input xml file is like this.' somedatainside this so many child nodes'. The input is **id** value and the **physical path** of the xml file. We need to copy the node data related to the given **id** from source file and create a new xml. Environment provided is - Windows 7, 64 bit JVM, 1 GB Heap space, input file of size more than 2GB. – Jayanand Oct 08 '14 at 09:53
Code used is 'VTDGenHuge vg = new VTDGenHuge(); if (vg.parseFile(BASE + SRC_FILE, true, VTDGenHuge.MEM_MAPPED)) { VTDNavHuge vnh = vg.getNav(); AutoPilotHuge aph = new AutoPilotHuge(vnh); aph.selectXPath(XPATH); int i = 0; while ((i = aph.evalXPath()) != -1) { long[] la = vnh.getElementFragment(); if (la != null) { vnh.getXML().writeToFileOutputStream( new FileOutputStream(BASE + OUTPUT_FOLDER + vnh.getCurrentIndex() + ".xml"), la[0], la[1]); } } } } }' – Jayanand Oct 08 '14 at 09:58
As i mentioned above, The business requirement is to provide 1GB Heap space only.It's taking 6-8mins appx for processing 2.1GB file using VTD EXTENDED with above code.Could u please provide any code tuning for reducing the time taken, so as to improve the performance. please respond positively and quickly(should process the file in 1-2 mins). i've crossed the deadline by 2 weeks already. – Jayanand Oct 08 '14 at 10:07
The code actually looks pretty good... I think that you are running into the fundamental limitation of memory mapping (ie. swapping file content in and out of memory from disk)... if you really have to do it quickly, buy some more memory and increase your heap size, the performance requirement of 1 2 min is achievable... – vtd-xml-author Oct 09 '14 at 03:16
In such case how much heap space do we need to provide to process a file of size more than 2GB. Please suggest. As i am using windows 7 OS and 64 bit JVM, there is no limitation to set the more heap space. but the business requirement is limiting it to 1GB. For information **http://publib.boulder.ibm.com/infocenter/javasdk/tools/index.jsp?topic=%2Fcom.ibm.java.doc.igaa%2F_1vg000139b8b453-11951f1e7ff-8000_1001.html** – Jayanand Oct 09 '14 at 14:35
Can we achieve the better performance by using .net with VTD XML Extended. Please suggest. – Jayanand Oct 09 '14 at 14:36
I think that 3GB is needed to avoid any disk IO based file swapping.I don't think there is a version of extended vtd-xml in .Net. Also since your OS is 64 bit, there is no such limit of 1Gb heap as far as I know. – vtd-xml-author Oct 09 '14 at 19:21
I didn't understood the term "disk IO based file swapping" in the above comment. I am using the memory mapped parsing while using the VTD XML jar. – Jayanand Oct 12 '14 at 14:12
Hi **Mr.vtd-xml-author**, if i need to process a 4GB xml which is the maximum size of input file, how much heap space do i need to set to avoid OutOfSpaceError and it should be processed in a span of 1 minute approximately. – Jayanand Oct 12 '14 at 14:16
To be safe get 8gb of memory. But really the more the better. – vtd-xml-author Oct 13 '14 at 06:35

How to spit the large XML(more than 3GB) using VTD-XML extended

1 Answers1