Unescaping HTML without touching XML characters

Question

I have an XML file containing escaped HTML characters and escaped XML characters as seen here:

<question description="How can I unescape only HTML characters such as: &Atilde; and &#48;,but not special characters such as &amp;">

How can I unescape all HTML characters and leave the below XML characters unescaped:

- &amp;
- &gt;
-&lt;
-&quot;
-&apos;

When I used StringEscapeUtils.unescapeHtml() it also escaped the XML characters.

Does this answer your question? [How to unescape HTML entities but leave XML entities untouched?](https://stackoverflow.com/questions/16347441/how-to-unescape-html-entities-but-leave-xml-entities-untouched) — Majid Hajibaba, Jul 13 '21 at 11:10

Majid Hajibaba · Answer 1 · 2021-07-13T13:55:47.923

From @Roman post

Create a class and name it HtmlEscapeUtils:

import org.apache.commons.text.translate.AggregateTranslator;
import org.apache.commons.text.translate.CharSequenceTranslator;
import org.apache.commons.text.translate.EntityArrays;
import org.apache.commons.text.translate.LookupTranslator;
import org.apache.commons.text.translate.NumericEntityUnescaper;


public class HtmlEscapeUtils {

  /**
   * @see {@link org.apache.commons.text.StringEscapeUtils#UNESCAPE_HTML4}
   */
  public static final CharSequenceTranslator UNESCAPE_HTML_SPECIFIC =
      new AggregateTranslator(
          new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
          new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
          new NumericEntityUnescaper());


  /**
   * @see {@link org.apache.commons.text.StringEscapeUtils#unescapeHtml4(String)}
   * @param input - HTML String with e.g. &quot; &amp; &auml;
   * @return XML String, HTML4 Entities replaced, but XML Entites remain (e.g. &quot; und &amp;)
   */
  public static final String unescapeHtmlToXml(final String input) {
    return UNESCAPE_HTML_SPECIFIC.translate(input);
  }

}

And use it in your program

 public static void main( String[] args )
    {
        String source = "How can I unescape only HTML characters such as: &Atilde; and &#48;,but not special characters such as &amp; or &gt;";
        String escaped = HtmlEscapeUtils.unescapeHtmlToXml(source);
        System.out.println(escaped);
    }

You need the following dependency in your program

<dependency>
  <groupId>org.apache.commons</groupId>
  <artifactId>commons-text</artifactId>
  <version>1.9</version>
</dependency>

Unescaping HTML without touching XML characters

1 Answers1