How to strip whitespace-only text nodes from a DOM before serialization?

Question

I have some Java (5.0) code that constructs a DOM from various (cached) data sources, then removes certain element nodes that are not required, then serializes the result into an XML string using:

// Serialize DOM back into a string
Writer out = new StringWriter();
Transformer tf = TransformerFactory.newInstance().newTransformer();
tf.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
tf.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
tf.setOutputProperty(OutputKeys.INDENT, "no");
tf.transform(new DOMSource(doc), new StreamResult(out));
return out.toString();

However, since I'm removing several element nodes, I end up with a lot of extra whitespace in the final serialized document.

Is there a simple way to remove/collapse the extraneous whitespace from the DOM before (or while) it's serialized into a String?

score 39 · Accepted Answer · answered Jun 11 '09 at 06:18

39

You can find empty text nodes using XPath, then remove them programmatically like so:

XPathFactory xpathFactory = XPathFactory.newInstance();
// XPath to find empty text nodes.
XPathExpression xpathExp = xpathFactory.newXPath().compile(
        "//text()[normalize-space(.) = '']");  
NodeList emptyTextNodes = (NodeList) 
        xpathExp.evaluate(doc, XPathConstants.NODESET);

// Remove each empty text node from document.
for (int i = 0; i < emptyTextNodes.getLength(); i++) {
    Node emptyTextNode = emptyTextNodes.item(i);
    emptyTextNode.getParentNode().removeChild(emptyTextNode);
}

This approach might be useful if you want more control over node removal than is easily achieved with an XSL template.

answered Jun 11 '09 at 06:18

James Murty

1,818
1
16
16

I like this "code only" solution even better than the XSL solution, and like you said there is a bit more control over node removal, if required. – Marc Novakowski Jun 11 '09 at 16:30
2

By the way, this method only seems to work if I first call doc.normalize() before doing the node removal. I'm not sure why that makes a difference. – Marc Novakowski Jun 11 '09 at 19:20
3

Excellent answer. Works for me even without normalize(). – james.garriss Feb 20 '12 at 14:09
2

@MarcNovakowski Sample case that need a call to `normalize()`. Load some XML string in a DOM object. Call `removeChild()` method to get some nodes out of the DOM object. Then try to strip whitespaces like in this current answer (`//text()[normalize-space(.) = '']`). Blank lines appear where nodes are removed. This won't happen if `normalize()` is called first. – Stephan Feb 27 '17 at 11:30

score 8 · Answer 2 · edited Mar 02 '16 at 12:52

Try using the following XSL and the strip-space element to serialize your DOM:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:output method="xml" omit-xml-declaration="yes"/>

  <xsl:strip-space elements="*"/>

  <xsl:template match="@*|node()">
    <xsl:copy>
     <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

http://helpdesk.objects.com.au/java/how-do-i-remove-whitespace-from-an-xml-document

Venkata Raju · Answer 3 · 2013-04-29T18:40:00.980

Below code deletes the comment nodes and text nodes with all empty spaces. If the text node has some value, value will be trimmed

public static void clean(Node node)
{
  NodeList childNodes = node.getChildNodes();

  for (int n = childNodes.getLength() - 1; n >= 0; n--)
  {
     Node child = childNodes.item(n);
     short nodeType = child.getNodeType();

     if (nodeType == Node.ELEMENT_NODE)
        clean(child);
     else if (nodeType == Node.TEXT_NODE)
     {
        String trimmedNodeVal = child.getNodeValue().trim();
        if (trimmedNodeVal.length() == 0)
           node.removeChild(child);
        else
           child.setNodeValue(trimmedNodeVal);
     }
     else if (nodeType == Node.COMMENT_NODE)
        node.removeChild(child);
  }
}

Ref: http://www.sitepoint.com/removing-useless-nodes-from-the-dom/

The method is useful for small xml but not for large xml with lot of nested nodes. For 4 K records, it took around 30 sec to process it. I would suggest to read xml as string and then use ```xmlString.replaceAll("\\p{javaWhitespace}+", "");``` it will be quick then. — NIGAGA, Nov 04 '20 at 09:57

pimlottc · Answer 4 · 2015-01-23T19:14:37.417

Another possible approach is to remove neighboring whitespace at the same time as you're removing the target nodes:

private void removeNodeAndTrailingWhitespace(Node node) {
    List<Node> exiles = new ArrayList<Node>();

    exiles.add(node);
    for (Node whitespace = node.getNextSibling();
            whitespace != null && whitespace.getNodeType() == Node.TEXT_NODE && whitespace.getTextContent().matches("\\s*");
            whitespace = whitespace.getNextSibling()) {
        exiles.add(whitespace);
    }

    for (Node exile: exiles) {
        exile.getParentNode().removeChild(exile);
    }
}

This has the benefit of keeping the rest of the existing formatting intact.

score 0 · Answer 5 · answered Jul 20 '16 at 17:54

The following code works:

public String getSoapXmlFormatted(String pXml) {
    try {
        if (pXml != null) {
            DocumentBuilderFactory tDbFactory = DocumentBuilderFactory
                    .newInstance();
            DocumentBuilder tDBuilder;
            tDBuilder = tDbFactory.newDocumentBuilder();
            Document tDoc = tDBuilder.parse(new InputSource(
                    new StringReader(pXml)));
            removeWhitespaces(tDoc);
            final DOMImplementationRegistry tRegistry = DOMImplementationRegistry
                    .newInstance();
            final DOMImplementationLS tImpl = (DOMImplementationLS) tRegistry
                    .getDOMImplementation("LS");
            final LSSerializer tWriter = tImpl.createLSSerializer();
            tWriter.getDomConfig().setParameter("format-pretty-print",
                    Boolean.FALSE);
            tWriter.getDomConfig().setParameter(
                    "element-content-whitespace", Boolean.TRUE);
            pXml = tWriter.writeToString(tDoc);
        }
    } catch (RuntimeException | ParserConfigurationException | SAXException
            | IOException | ClassNotFoundException | InstantiationException
            | IllegalAccessException tE) {
        tE.printStackTrace();
    }
    return pXml;
}

public void removeWhitespaces(Node pRootNode) {
    if (pRootNode != null) {
        NodeList tList = pRootNode.getChildNodes();
        if (tList != null && tList.getLength() > 0) {
            ArrayList<Node> tRemoveNodeList = new ArrayList<Node>();
            for (int i = 0; i < tList.getLength(); i++) {
                Node tChildNode = tList.item(i);
                if (tChildNode.getNodeType() == Node.TEXT_NODE) {
                    if (tChildNode.getTextContent() == null
                            || "".equals(tChildNode.getTextContent().trim()))
                        tRemoveNodeList.add(tChildNode);
                } else
                    removeWhitespaces(tChildNode);
            }
            for (Node tRemoveNode : tRemoveNodeList) {
                pRootNode.removeChild(tRemoveNode);
            }
        }
    }
}

This answer would benefit by some explanation. – Eiko Jul 20 '16 at 17:57 — Eiko, Jul 20 '16 at 17:57

score 0 · Answer 6 · answered Aug 25 '20 at 11:18

I did it like this

    private static final Pattern WHITESPACE_PATTERN = Pattern.compile("\\s*", Pattern.DOTALL);

    private void removeWhitespace(Document doc) {
        LinkedList<NodeList> stack = new LinkedList<>();
        stack.add(doc.getDocumentElement().getChildNodes());
        while (!stack.isEmpty()) {
            NodeList nodeList = stack.removeFirst();
            for (int i = nodeList.getLength() - 1; i >= 0; --i) {
                Node node = nodeList.item(i);
                if (node.getNodeType() == Node.TEXT_NODE) {
                    if (WHITESPACE_PATTERN.matcher(node.getTextContent()).matches()) {
                        node.getParentNode().removeChild(node);
                    }
                } else if (node.getNodeType() == Node.ELEMENT_NODE) {
                    stack.add(node.getChildNodes());
                }
            }
        }
    }

score -3 · Answer 7 · edited Sep 14 '11 at 20:24

-3

transformer.setOutputProperty(OutputKeys.INDENT, "yes");

This will retain xml indentation.

edited Sep 14 '11 at 20:24

Jérôme Verstrynge

57,710
92
283
453

answered Jan 05 '11 at 08:10

Swapna Kasula

3
1

2

It does not strip superfluous spaces. – Thorbjørn Ravn Andersen Sep 01 '15 at 15:18

How to strip whitespace-only text nodes from a DOM before serialization?

7 Answers7

Linked