0

I have an xml file and I want to manipulate the tags using the Java DOM, but its size is 25 gega-octets, so its telling me I can't and shows me this error:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

    public Frwiki() {
        filePath = "D:\\compressed\\frwiki-latest-pages-articles.xml";
    }

    public void deletingTag() throws Exception {
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        Document doc = factory.newDocumentBuilder().parse(filePath);
        NodeList nodes = doc.getElementsByTagName("*");

        for (int j = 0; j < 3; j++) {
            for (int i = 0; i < nodes.getLength(); i++) {
                Node node = nodes.item(i);
                if (!node.getNodeName().equals("id") && !node.getNodeName().equals("title")
                        && !node.getNodeName().equals("text") && !node.getNodeName().equals("mediawiki")
                        && !node.getNodeName().equals("revision") && !node.getNodeName().equals("page"))
                    node.getParentNode().removeChild(node);
            }
        }

        Transformer transformer = TransformerFactory.newInstance().newTransformer();
        transformer.transform(new DOMSource(doc), new StreamResult(filePath));
    }
Robert
  • 7,394
  • 40
  • 45
  • 64
Moharach
  • 3
  • 2
  • When are you getting Java Lang Out of Memory? When reading the file or in the for loop? What line is getting that error? – Gatusko Jan 19 '23 at 14:14
  • 2
    Unless you have a huge machine, you won't be able to create a DOM tree of a 25G XML file. Best guess that will require something close to 250G RAM. See if you can use one of the streaming XML APIs instead, such as SAX or StAX. – ewramner Jan 19 '23 at 14:31
  • i had no error number when the exception occurred, i cannot read a file of 25 go, i am looking for a way to read it line by line. – Moharach Jan 22 '23 at 08:19

2 Answers2

1

You can split a large file into smaller files using XSLT 3.0 streaming, like this:

<xsl:transform version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    
    <xsl:template name="xsl:initial-template">
      <xsl:source-document streamable="yes" href="frwiki-latest-pages-articles.xml">
        <xsl:for-each-group ....>
           <xsl:result-document href="......">
              <part><xsl:copy-of select="current-group()"/></part>
           </xsl:result-document>
        </xsl:for-each-group>
      </xsl:source-document>
    </xsl:template>
    
</xsl:transform>

The "..." parts depend on how you want to split the document and name the result files.

Although XSLT 3.0 streaming is a W3C specification, the only implementation available at the moment is my company's Saxon-EE processor.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
-1

Split the large XML file into smaller chunks and process them separately.