2

We are fetching XML from one source and then passing onto another entity for further processing. However, the fetched XML contains special characters in the attribute value which are not acceptable to the next process. For e.g.

Sample Input :

"<Message text="<html>Welcome User, <br> Happy to have you. <br>.</html>"

Expected Output:

"<Message text="&lt;html&gt;Welcome User, &lt;br&gt; Happy to have you. &lt;br&gt;.&lt;/html&gt;">

Sample Input : <Message text="<html>Welcome User, <br> Happy to have you. </html>" Multi="false"> <Meta source="system" dest="any"></Meta></Message>

Output: <Message text="&lt;html&gt;Welcome User, &lt;br&gt; Happy to have you. &lt;/html&gt;" Multi="false"> <Meta source="system" dest="any"></Meta></Message>

But the <br> won't be replaced in case the input has multiple <br> tags.

We are using following code :

String xml = "<Message text=\"<html>Welcome User, <br> Happy to have you. <br>.</html>\" Multi=\"false\"><Meta source=\"system\" dest=\"any\"></Meta></Message>";
System.out.println("ORG:" + xml);
xml = replaceChars(xml);
System.out.println("NEW:" + xml);

private static String replaceChars(String xml)
        {
           xml = xml.replace("&", "&amp;");
           xml = xml.replaceAll("\"<([^<]*)>", "\"&lt;$1&gt;");
            xml = xml.replaceAll("</([^<]*)>\"", "&lt;/$1&gt;\"");
            xml = xml.replaceAll("\"([^<]*)<([^<]*)>([^<]*)\"", "\"$1&lt;$2&gt;$3\"");

            return xml;
        }
diginoise
  • 7,352
  • 2
  • 31
  • 39
Chota Bheem
  • 1,106
  • 1
  • 13
  • 31
  • We are not parsing the xml. We just want to remove those characters due to which it's not parsing by SAX parser in the next stage. – Chota Bheem Jul 05 '18 at 12:20
  • Does this answer your question? [removing invalid XML characters from a string in java](https://stackoverflow.com/questions/4237625/removing-invalid-xml-characters-from-a-string-in-java) – Martin Schröder Sep 01 '20 at 11:58

3 Answers3

2

To match you can use regular expression:

(?:<)(?<=<)(\/?\w*)(?=.*(?<=<\/html))(?:>)

  • (?:<) Match but don't capture <.
  • (?<=<) Positive lookbehind for <.
  • (\/?\w*) Capture tag name. Optional / and word characters.
  • (?=.*(?<=<\/html)) Positive lookahead, then positive lookbehind for closing tag.
  • (?:>) Match but don't capture >.

To replace you can use:

  • &lt;$1&gt;

Where $1 is the result of the capture group in the regular expression. You can test the regular expression interactively here.

Using the following Java code:

 public static void main(String []args){
    String xml = "<Message text=\"<html>Welcome User, <br> Happy to have you. <br>.</html>\" Multi=\"false\"><Meta source=\"system\" dest=\"any\"></Meta></Message>";
    String newxml = replaceChars(xml);
    System.out.println(newxml);
 }

 private static String replaceChars(String xml)
    {
       xml = xml.replaceAll("(?:<)(?<=<)(\/?\w*)(?=.*(?<=<\/html))(?:>)", "&lt;$1&gt;");
       return xml;
    }

The output is:

"<Message text="&lt;html&gt;Welcome User, &lt;br&gt; Happy to have you. &lt;/html&gt;" Multi="false"> <Meta source="system" dest="any"></Meta></Message>"

Paolo
  • 21,270
  • 6
  • 38
  • 69
  • It is partially correct. The whole output is : ` </Meta></Message>` Observe the closing tags for `Meta` and `Message`. Basically, we would want to consider only those content which is between `""`(double quotes). – Chota Bheem Jul 05 '18 at 13:19
  • @Chota Right, I get you. Please try `(?:<)(?<=<)(\/?\w*)(?=.*(?<=<\/html))(?:>)` [here](https://regex101.com/r/vvUyB2/1/). Let me know and I will update my answer. – Paolo Jul 05 '18 at 13:49
  • Yes this is better but this seems to expect that it would always end with ` – Chota Bheem Jul 05 '18 at 13:52
  • 1
    It is trivial to add additional cases to the second lookbehind for tags you know you will want to match, i.e. `(?<=<\/html|\/br)` – Paolo Jul 05 '18 at 15:12
2

Please do not use regular expressions to escape special characters in XML.

Can you guarantee that this will work for all possible html input with all of HTML and XML quirks (very extensive specs!!!) ?

Just use one of many utilities out there to escape XML strings.

Apache Commons is quite popular - please see this example

diginoise
  • 7,352
  • 2
  • 31
  • 39
1

XML is not text. In fact, XML documents are a binary format.

Processing XML as text is the wrong approach, and only works in simple cases. Things to consider:

  • The XML document has no file encoding, but content encoding specified IN the document (thus it must be read by an XML parser, which correctly handles this).
  • XML documents use XML entities (built-ins like &amp;, &lt;, &gt; and &quot;, other can be arbitrarily defined in DDL, see https://www.w3resource.com/xml/entities.php).
  • XML document can contain CDATA

Therefore:

  • use a proper XML parser to read documents
  • perform manipulations (text replacement, add/remove nodes) on the DOM (document object model) or streaming model.
  • use a proper XML processor to write documents

By the way, the XML in your example is NOT xml (malformed as no entities are used for <, >, ")

Peter Walser
  • 15,208
  • 4
  • 51
  • 78