0

My Requirement is : I have 1GB xml file and want to remove few nodes from xml file.Here removing xml nodes can be anything in entire file which is based on the input.What is the best parser in JAVA. I'm Currently using DOM parser and it is working fine for 100MB files but it is throwing out of memory error :heap space for 1 GB file. Can anyone suggest best approach for my code below:

    public static void main(String[] args) {
    DocumentBuilder docBuilder = null;
    File inputFile = new File("/scratch/bigfile/final.txt");
    // Parse the xml file using DOM parser
    try{
    DocumentBuilderFactory docBuilderFactory =DocumentBuilderFactory.newInstance();
    docBuilderFactory.setExpandEntityReferences(false);
    docBuilderFactory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);
    docBuilder = docBuilderFactory.newDocumentBuilder();
       Document doc= docBuilder.parse(inputFile);

       // Remove unwanted nodes from xml file
        Element element1 = (Element) doc.getElementsByTagName("G_SUMMARY_ROWSET").item(0);
        element1.getParentNode().removeChild(element1);
        Element element2 = (Element) doc.getElementsByTagName("G_JRNLSOURCE_ROWSET").item(0);
        element2.getParentNode().removeChild(element2);
        Element element3 = (Element) doc.getElementsByTagName("G_JRNLSOURCE_UNMATCHED_ROWSET").item(0);
        element3.getParentNode().removeChild(element3);
        Element element4 = (Element) doc.getElementsByTagName("G_JRNLDETAILS_UNMATCHED_ROWSET").item(0);
        element4.getParentNode().removeChild(element4);

        // Convbert Dom Document to Byte array
        TransformerFactory transformerFactory = TransformerFactory.newInstance();
        Transformer transformer = transformerFactory.newTransformer();
        DOMSource source = new DOMSource(doc);
        ByteArrayOutputStream bos=new ByteArrayOutputStream();
         StreamResult result=new StreamResult(bos);
         transformer.transform(source, result);
         byte []array=bos.toByteArray();
        System.out.println(array.length);

    }
    catch (Exception e) {
             e.printStackTrace();
          }
}
user2155454
  • 95
  • 3
  • 12

1 Answers1

1

Consider using SAXParser. It is generally better to use a SAXParser for larger files because the data is not stored in memory and discards most elements after they have been processed. This would solve your issue of running out of memory.

This is contrasted with a DOM (Document Object Model) parser where the entire document is loaded into memory.

Nick
  • 823
  • 2
  • 10
  • 22
  • I forgot to mention one point here.We are storing the 1 GB data into byte array (not in file) using input stream from server.So In this case can we use SAX Parser as byte array stores in memory.Please suggest – user2155454 Jul 31 '18 at 05:25
  • @user2155454 your issue then is that you can't store all 1 GB into a byte array because you don't have that much memory. Your options are to either not store it in memory using a SAX Parser or increase the heap size. You can increase the heap size with the command `java -Xmx3g your_program`, where 6g means 3 GB. You can do more or less based off your machine specifications. – Nick Jul 31 '18 at 13:45
  • 1
    @user2155454 - Yes you can. You can hand a ByteArrayInputStream to the SaxParser, – Alohci Jul 31 '18 at 14:01