0

I have the following XML:

<root><super-head>Text ​☆ and "more" ♥?</super-head></root>

And some entitys (Actually over 400 pieces):

☆ = &star;
♥ = &heart;
" = &quot;
? = &quest;
- = &hyphen;

Now i want to replace all characters in the list with their entity. Initially I tried to do this using a regular expression, but it doesn't work. So I assume that Java or XSLT (I can only use 1.0 here) must be used.

In Java i tried the following:

public void replaceStringForNode(Node node, Map<String, String> map) {
    // replace for all attributes
    NamedNodeMap attributes = node.getAttributes();
    for (int i = 0, l = attributes.getLength(); i < l; i++) {
        Node attr = attributes.item(i);
        String content = attr.getNodeValue();
        for (Entry<String, String> entry : map.entrySet()) {
            content = content.replace(entry.getKey(), entry.getValue());
        }
        attr.setNodeValue(content);
    }

    // check all child nodes
    NodeList nodeList = node.getChildNodes();
    for (int i = 0; i < nodeList.getLength(); i++) {
        Node currentNode = nodeList.item(i);
        int type = currentNode.getNodeType();
        if (type == Node.ELEMENT_NODE) {
            this.replaceStringForNode(currentNode, map);
        } else if (type == Node.TEXT_NODE) {
            String content = currentNode.setNodeValue();
            for (Entry<String, String> entry : map.entrySet()) {
                content = content.replace(entry.getKey(), entry.getValue());
            }
            currentNode.setNodeValue(content);;
        }
    }
}

but in this case i will get the following xml (with escaped & characters):

<root><super-head>Text ​&amp;star; and &amp;qout;more&amp;qout; &amp;heart;&amp;quest;</super-head></root>

How can i convert it the best way or fix my issue?

kpalatzky
  • 1,213
  • 1
  • 11
  • 26
  • 2
    Can you save your entities without the ampersand? `♥ = heart;` – achAmháin Apr 19 '18 at 12:57
  • @notyou no it has to be an valid entity – kpalatzky Apr 19 '18 at 13:30
  • Why that restriction to XSLT 1.0 if you are also using Java where Saxon 9 exists as a kind of reference implementation of XSLT 2 or 3 and using them you could easily solve the task using a character map (see https://xsltfiddle.liberty-development.net/eiZQaEW/1)? – Martin Honnen Apr 19 '18 at 13:38
  • @MartinHonnen Because i am not allowed to use it :( – kpalatzky Apr 19 '18 at 13:39
  • Possible duplicate of [Parsing XML file containing HTML entities in Java without changing the XML](https://stackoverflow.com/questions/36026353/parsing-xml-file-containing-html-entities-in-java-without-changing-the-xml) – Stavr00 Apr 19 '18 at 14:07
  • ♥ is a valid character entity in XML. &heart; is not. You could apply a DTD overlay to XML and then use it. But, why? Why not ♥? – Tom Blodget Apr 19 '18 at 16:29

1 Answers1

5

If you set the output encoding to US-ASCII this will force all the non-ascii to be encoded with the pattern &#nnnn; using the code point of the entity.

transformer.setOutputProperty(OutputKeys.ENCODING, Charset.US-ASCII.name());

Your entities don't work because there are only five defaults defined in XML. You have to declare them at the beginning of your XML document.

<!ENTITY star     "&#9734;"> 
<!ENTITY hearts   "&#9829;"> 
      . . . 

You may have to use the Apache utility class that understands HTML entities:

String org.apache.commons.text.StringEscapeUtils.escapeHtml4(String input) 
String org.apache.commons.text.StringEscapeUtils.escapeXml10(String input)

and incorporate them in your own customized EntityResolver class. The entity mapping should not happen inside the DOM objects but rather at the Transformation step where the DOM is serialized to a stream, writer, string or byte array.


Ok, now for the editorial part of the answer.

Don't.

Just don't use external DTD entities or special parsing hacks. Let the XML transformer use its default behavior to parse or write out the XML. Let it write out numeric entities in the XML output. Every browser or XML parser will know what to do with them.

Stavr00
  • 3,219
  • 1
  • 16
  • 28
  • Thanks for the answer. Here i have the problem that `☆` is used and not `&star;` - How can i use `&star;` instead of `☆`? p.s. all entities are defined in my DTD file – kpalatzky Apr 19 '18 at 13:32
  • Using non-standard entities is problematic if your parser does not handle mapping the DTD entities. – Stavr00 Apr 19 '18 at 13:57
  • Can you may give an example how i can use the `EntityResolver ` in this example? – kpalatzky Apr 19 '18 at 13:58
  • Some parsers expressly forbid external entity declarations to block attacks such as the _billion laugh_ – Stavr00 Apr 19 '18 at 14:05