0

I understand that XML has 5 special characters that MUST be escaped (",',<,>,&) I am trying to implement the following:

Input xml:

<?xml version = "1.0"?>
<class>
  <student id = "999">
  <firstname>Tes"Ting</firstname>
  <lastname>He'llo</lastname>
  <nickname1>W<or>ld</nickname>
  <nickname2>star&wars</nickname2>
  </student>
</class>

Output XML:

 <?xml version = "1.0"?>
  <class>
  <student id = "999">
  <firstname>Tes&quot;Ting</firstname>
  <lastname>He&apos;llo</lastname>
  <nickname>W&lt;orl&gt;d</nickname>
  <nickname2>star&amp;wars</nickname2>
  </student>
</class>

Following is my code which works fine if there is single quote (') and double quotes ("). When the code finds &, <, >..the XML parser throws an error. Can anyone please suggest how to implement? any thoughts?

import org.xml.sax.SAXException;
import org.w3c.dom.*;
import javax.xml.parsers.*;
import java.io.IOException;
import com.vordel.trace.Trace;
import org.xml.sax.InputSource;
import org.apache.commons.lang.StringEscapeUtils;

========Logic=====
    def input = <input xml in string>   
    def temp;
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();   
    DocumentBuilder db = dbf.newDocumentBuilder();
    Document doc = db.parse(new InputSource(new StringReader(input)));
    doc.getDocumentElement().normalize();
    NodeList nList = doc.getElementsByTagName("student");

    for (temp = 0; temp < nList.getLength(); temp++) 
    {
      Node nNode = nList.item(temp);          
      if (nNode.getNodeType() == Node.ELEMENT_NODE) {
         Element eElement = (Element) nNode;
         escapedfirstname=     StringEscapeUtils.escapeXml(eElement.getElementsByTagName("firstname").item(0).getTextContent() );

         escapedlastname= StringEscapeUtils.escapeXml(eElement.getElementsByTagName("lastname").item(0).getTextContent() );

           }
         }
user3384231
  • 3,641
  • 2
  • 18
  • 27
  • 1
    You're having a large string that looks a bit like XML, and you're trying to convert it to a valid XML. This is not how escaping is done. Normally, XML is generated somewhat procedurally. In the simplest case, it can be as primitive as `"" + nickname + ""`. Escape should be done this way: `"" + escapeText(nickname) + ""`. By the time you've merged everything into one long string it's too late. –  Nov 08 '17 at 18:37

1 Answers1

1

It's not possible. It's not a matter of "can" be escaped for those characters -- they must be escaped in certain circumstances. For instance, how do you distinguish the text <or> from the tag <or>? The solution the designers of XML came up with is that for regular text, some characters must be escaped if they are meant to be text content -- in this case, the opening bracket < needs to be represented as &lt;.

  • In regular text, < and & must be escaped to avoid confusion with tags and escape codes.
  • In attributes, quotes matching the opening quote must also be escaped to avoid confusion with the closing quote.

All characters can be escaped in XML using numeric escape codes such as &#8364;

Stefan Haustein
  • 18,427
  • 3
  • 36
  • 51
  • So what's the solution you suggest? Ok my understanding is now clear that I need to replace all the 5 special character. Is there any way in java where I can directly pass the file and it will automatically replace the special characters instead of iterating the element one by one? Thank u for ur time. – user3384231 Nov 08 '17 at 18:33
  • 1
    How could that possibly work? It would replace the characters in the tags, too -- your file would become `<?xml version = "1.0"?>? <class>` You'll need to write the tags and content separately to avoid this problem – Stefan Haustein Nov 08 '17 at 18:38
  • @user3384231: Stephan is correct. See the duplicate link where options and references are presented. (#1 option by far is to fix the problem at the source. What you have is not XML as it stands. There are no guaranteed automatic repair methods in the general case.) Thanks. – kjhughes Nov 08 '17 at 18:47
  • @stephan: Yes thats what I thought as it will replace XML tags also but just thought if I am missing out on anything else. Writing tags and content separately is one option. kjhughes: I saw the other related post and there it mentions about preprocessing cleanup filter check inputstream..will that not work in my scenarios? Do u know? – user3384231 Nov 08 '17 at 19:34