0

I have an xml string from which I want to remove the empty elements and the line containing the element.

So fir example:

XML:

<ct>
   <c>http://192.168.105.213</c>
   <l>http://192.168.105.213</l>
   <o></o>
   <l>http://192.168.105.213</l>
   <o>http://192.168.105.213</o>
<ct>

In this <o></o> is the empty element, so after removing this element I want :

   <ct>
       <c>http://192.168.105.213</c>
       <l>http://192.168.105.213</l>
       <l>http://192.168.105.213</l>
       <o>http://192.168.105.213</o>
    <ct>

So the whole line must be removed such that it is indented back.

I tried: xml.replaceAll("<(\\w+)></\\1>", ""));

This leaves an empty line in between:

<ct>
   <c>http://192.168.105.213</c>
   <l>http://192.168.105.213</l>

   <l>http://192.168.105.213</l>
   <o>http://192.168.105.213</o>
</ct>

How to remove the space or \n, \t, \r correctly to get the proper indentation ?

Siddharth Trikha
  • 2,648
  • 8
  • 57
  • 101
  • 2
    Please, do not use regular expressions to parse XML. Never. See http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la – vanje Sep 30 '16 at 10:37
  • 2
    @vanje I like this answer better: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – online Thomas Sep 30 '16 at 10:44
  • @Thomas: Yes, you're right. – vanje Sep 30 '16 at 10:58

3 Answers3

2

This would work:

xml.replaceAll("<(\\w+)></\\1>\n\\s+", ""));

It would match a new line followed by one or more empty spaces (including tabs), which is preceded by your pattern.

EDIT: xml.replaceAll("\n\\s+<(\\w+)></\\1>", "") should work for deeper levels as well.

And if you expect the root element also to be empty and any of the child elements to be unintended, you might need to make the newline and spaces optional as

xml.replaceAll("\n?\\s*<(\\w+)></\\1>", "")
Naveed S
  • 5,106
  • 4
  • 34
  • 52
  • It works for a one level indentation, but for a deeply nested empty element will this remove proper spaces to maintain the indentation ? – Siddharth Trikha Sep 30 '16 at 11:06
  • @SiddharthTrikha Please have the newline+spaces combination before the tags as in the edit. It should work for deeper ones. – Naveed S Sep 30 '16 at 11:28
1

This should to solve it for you

xml.replaceAll("\n\t<(\\w+)></\\1>", "");
noned
  • 74
  • 10
1

As advised in comments, reconsider using regex directly on HTML/XML documents as these are not regular languages. Instead, use regex on parsed text/value content but not to transform documents.

One great XML manipulator tool is XSLT, the transformation language and sibling to XPath. And Java ships with a built-in XSLT 1.0 processor, and can also call or source external processors (Xalan, Saxon, etc.). Consider the following setup:

XSLT Script (save as .xsl file used below; script removes empty nodes)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>

  <!-- Identity Transform to Copy Document as is -->
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- Empty Template to Remove Such Nodes -->
  <xsl:template match="*[.='']"/>

</xsl:transform>

Java Code

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;

import javax.xml.transform.*;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.TransformerException;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.OutputKeys;

import java.io.File;
import java.io.IOException;
import java.net.URISyntaxException;

import org.w3c.dom.Document;
import org.xml.sax.SAXException;

public class XMLTransform {
    public static void main(String[] args) throws IOException, URISyntaxException,
                                                  SAXException, ParserConfigurationException,
                                                  TransformerException {            
            // Load XML and XSL Document
            String inputXML = "path/to/Input.xml";
            String xslFile = "path/to/XSLT/Script.xsl";
            String outputXML = "path/to/Output.xml";

            Source xslt = new StreamSource(new File(xslFile));            
            DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();            
            DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
            Document doc = docBuilder.parse (new File(inputXML));

            // XSLT Transformation  with pretty print
            TransformerFactory prettyPrint = TransformerFactory.newInstance();
            Transformer transformer = prettyPrint.newTransformer(xslt);

            transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
            transformer.setOutputProperty(OutputKeys.STANDALONE, "yes");
            transformer.setOutputProperty(OutputKeys.METHOD, "xml");
            transformer.setOutputProperty(OutputKeys.INDENT, "yes");
            transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
            transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");                        

            DOMSource source = new DOMSource(doc);
            StreamResult result = new StreamResult(new File(outputXML));        
            transformer.transform(source, result);
    }
}

Output

<ct>
    <c>http://192.168.105.213</c>
    <l>http://192.168.105.213</l>
    <l>http://192.168.105.213</l>
    <o>http://192.168.105.213</o>
</ct>

NAMESPACES

When working with namespaces such as the below XML:

<prefix:ct xmlns:prefix="http://www.example.com">
   <c>http://192.168.105.213</c>
   <l>http://192.168.105.213</l>
   <o></o>
   <l>http://192.168.105.213</l>
   <o>http://192.168.105.213</o>
</prefix:ct>

Use the following XSLT with declaration in header and added template:

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
               xmlns:prefix="http://www.example.com">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>

  <!-- Identity Transform -->
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- Retain Namespace Prefix -->
  <xsl:template match="ct">
    <xsl:element name='prefix:{local-name()}' namespace='http://www.example.com'>
      <xsl:copy-of select="namespace::*"/>
      <xsl:apply-templates select="node()|@*"/>
    </xsl:element>
  </xsl:template>

  <!-- Remove Empty Nodes -->
  <xsl:template match="*[.='']"/>

</xsl:transform>

Output

<prefix:ct xmlns:prefix="http://www.example.com">
    <c>http://192.168.105.213</c>
    <l>http://192.168.105.213</l>
    <l>http://192.168.105.213</l>
    <o>http://192.168.105.213</o>
</prefix:ct>
Community
  • 1
  • 1
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • I tried initially with XSLT with the same template as given by you but without this part ` `, with that the empty space was showing there. Will try with this with this last part added. Will this remove the spaces ? – Siddharth Trikha Sep 30 '16 at 17:27
  • Basically the white space of the indentation was also getting stripped. – Siddharth Trikha Sep 30 '16 at 17:33
  • Yes, as shown. In fact, that template match is the key item of script. The Identity Transform copies the entire document as is so changes nothing if you leave this empty template out. Also using `` removes unneeded whitespaces. – Parfait Sep 30 '16 at 17:34
  • Ok.. If we want to remove only particular empty elements with particular names, , , Are three empty elements an I want to remove only one and three element: would work?? – Siddharth Trikha Sep 30 '16 at 20:13
  • Almost. Do this: `` – Parfait Sep 30 '16 at 21:43
  • When I tried your example, after transformation my XML string is modified such that the prefix of my root element is getting removed. EG: XML before: ` THA true ` then after transformation XML: ` THA true ` so the prefix is getting stripped from my original XML – Siddharth Trikha Oct 03 '16 at 04:31
  • 1
    See updated section for *Namespaces* where you declare namespace in XSLT's header and add the new template. Example included. Aside - always include namespaces when asking XML questions. Also, with namespaces, this would be insane to do with regex! – Parfait Oct 03 '16 at 17:27
  • Ok..I if I use this template for many different XML strings, there will be many elements with namespace prefix. So here you matched only `ct` element with prefix, how to have something generalized so that all elements prefix are retained ? – Siddharth Trikha Oct 04 '16 at 11:50
  • No one XML file is the same. So XSLT will have to be customized. For other elements with prefixes, simply walk down the tree with different templates. Usually namespaces are a handful not hundreds. – Parfait Oct 04 '16 at 12:21